Quantum relative entropy training of boltzmann machines

ABSTRACT

Methods to train a quantum Boltzmann machine (QBM) having one or more visible nodes and one or more hidden nodes. The methods comprise associating each visible and each hidden node of the QBM to a different corresponding qubit of a plurality of qubits of a quantum computer, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM. The methods further comprise providing a distribution of training data over the one or more output qubits, estimating a gradient of a quantum relative entropy between the output qubits and the distribution of training data, and training the set of weighting factors based on the estimated gradient using the quantum relative entropy as a cost function.

BACKGROUND

A quantum computer is a physical machine configured to execute logical operations based on or influenced by quantum-mechanical phenomena. Such logical operations may include, for example, mathematical computation. Current interest in quantum-computer technology is motivated by theoretical analysis suggesting that the computational efficiency of an appropriately configured quantum computer may surpass that of any practicable non-quantum computer when applied to certain types of problems. Such problems include, for example, integer factorization, data searching, computer modeling of quantum phenomena, function optimization including machine learning, and solution of systems of linear equations. Moreover, it has been predicted that continued miniaturization of conventional computer logic structures will ultimately lead to the development of nanoscale logic components that exhibit quantum effects, and must therefore be addressed according to quantum-computing principles.

In the application of quantum computers to neural networks, various challenges persist as to the manner in which a quantum neural network may be trained to a desired task. In classical neural-network training, parametric weights and thresholds are optimized according to the gradient of a cost function evaluated over the domain of neuron states. For quantum neural networks, however, the gradient of the cost function may be difficult to estimate due to operator non-commutivity and other analytical complexities.

SUMMARY

This disclosure describes, inter alia, methods to train a quantum Boltzmann machine (QBM) having one or more visible nodes and one or more hidden nodes. The methods comprise associating each visible and each hidden node of the QBM to a different corresponding qubit of a plurality of qubits of a quantum computer, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM. The methods further comprise providing a distribution of training data over the one or more output qubits, estimating a gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data, and training the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows aspects of an example quantum computer.

FIG. 2 illustrates a Bloch sphere, which graphically represents the quantum state of one qubit of a quantum computer.

FIG. 3 shows aspects of an example signal waveform for effecting a quantum-gate operation in a quantum computer.

FIG. 4A shows aspects of an example Boltzmann machine.

FIG. 4B shows aspects of an example restricted Boltzmann machine.

FIG. 5 illustrates an example method to train a quantum Boltzmann machine having visible and hidden nodes.

FIGS. 6A and 6B illustrate an example method to estimate the gradient of the quantum relative entropy of a restricted quantum Boltzmann machine having visible and hidden nodes.

FIG. 7 illustrates an example method to estimate the gradient of the quantum relative entropy of a restricted or non-restricted quantum Boltzmann machine.

DETAILED DESCRIPTION

FIG. 1 shows aspects of an example quantum computer 10 configured to execute quantum-logic operations (vide infra). Whereas conventional computer memory holds digital data in an array of bits and enacts bit-wise logic operations, a quantum computer holds data in an array of qubits and operates quantum-mechanically on the qubits in order to implement the desired logic. Accordingly, quantum computer 10 of FIG. 1 includes at least one qubit register 12 comprising an array of qubits 14. The illustrated qubit register is eight qubits in length; qubit registers comprising longer and shorter qubit arrays are also envisaged, as are quantum computers comprising two or more qubit registers of any length.

Qubits 14 of qubit register 12 may take various forms, depending on the desired architecture of quantum computer 10. Each qubit may comprise: a superconducting Josephson junction, a trapped ion, a trapped atom coupled to a high-finesse cavity, an atom or molecule confined within a fullerene, an ion or neutral dopant atom confined within a host lattice, a quantum dot exhibiting discrete spatial- or spin-electronic states, electron holes in semiconductor junctions entrained via an electrostatic trap, a coupled quantum-wire pair, an atomic nucleus addressable by magnetic resonance, a free electron in helium, a molecular magnet, or a metal-like carbon nanosphere, as non-limiting examples. More generally, each qubit 14 may comprise any particle or system of particles that can exist in two or more discrete quantum states that can be measured and manipulated experimentally. For instance, a qubit may be implemented in the plural processing states corresponding to different modes of light propagation through linear optical elements (e.g., mirrors, beam splitters and phase shifters), as well as in states accumulated within a Bose-Einstein condensate.

FIG. 2 is an illustration of a Bloch sphere 16, which provides a graphical description of some quantum mechanical aspects of an individual qubit 14. In this description, the north and south poles of the Bloch sphere correspond to the standard basis vectors |0

and |1

, respectively up and down spin states, for example, of an electron or other fermion. The set of points on the surface of the Bloch sphere comprise all possible pure states |ψ

of the qubit, while the interior points correspond to all possible mixed states. A mixed state of a given qubit may result from decoherence, which may occur because of undesirable coupling to external degrees of freedom.

Returning now to FIG. 1, quantum computer 10 includes a controller 18. The controller may include at least one processor 20 and associated computer memory 22. A processor 20 of controller 18 may be coupled operatively to peripheral componentry, such as network componentry, to enable the quantum computer to be operated remotely. A processor 20 of controller 18 may take the form of a central processing unit (CPU), a graphics processing unit (GPU), or the like. As such, the controller may comprise classical electronic componentry. The term ‘classical’ is applied herein to any component that can be modeled accurately as an ensemble of particles without considering the quantum state of any individual particle. Classical electronic components include integrated, microlithographed transistors, resistors, and capacitors, for example. Computer memory 22 may be configured to hold program instructions 24 that cause processor 20 to execute any function or process of the controller. In examples in which qubit register 12 is a low-temperature or cryogenic device, controller 18 may include control componentry operable at low or cryogenic temperatures—e.g., a field-programmable gate array (FPGA) operated at 77K. In such examples, the low-temperature control componentry may be coupled operatively to interface componentry operable at normal temperatures.

Controller 18 of quantum computer 10 is configured to receive a plurality of inputs 26 and to provide a plurality of outputs 28. The inputs and outputs may each comprise digital and/or analog lines. At least some of the inputs and outputs may be data lines through which data is provided to and extracted from the quantum computer. Other inputs may comprise control lines via which the operation of the quantum computer may be adjusted or otherwise controlled.

Controller 18 is operatively coupled to qubit register 12 via quantum interface 30. The quantum interface is configured to exchange data bidirectionally with the controller. The quantum interface is further configured to exchange signal corresponding to the data bidirectionally with the qubit register. Depending on the architecture of quantum computer 10, such signal may include electrical, magnetic, and/or optical signal. Via signal conveyed through the quantum interface, the controller may interrogate and otherwise influence the quantum state held in the qubit register, as defined by the collective quantum state of the array of qubits 14. To this end, the quantum interface includes at least one modulator 32 and at least one demodulator 34, each coupled operatively to one or more qubits of the qubit register. Each modulator is configured to output a signal to the qubit register based on modulation data received from the controller. Each demodulator is configured to sense a signal from the qubit register and to output data to the controller based on the signal. The data received from the demodulator may, in some examples, be an estimate of an observable to the measurement of the quantum state held in the qubit register.

In some examples, suitably configured signal from modulator 32 may interact physically with one or more qubits 14 of qubit register 12 to trigger measurement of the quantum state held in one or more qubits. Demodulator 34 may then sense a resulting signal released by the one or more qubits pursuant to the measurement, and may furnish the data corresponding to the resulting signal to the controller. Stated another way, the demodulator may be configured to output, based on the signal received, an estimate of one or more observables reflecting the quantum state of one or more qubits of the qubit register, and to furnish the estimate to controller 18. In one non-limiting example, the modulator may provide, based on data from the controller, an appropriate voltage pulse or pulse train to an electrode of one or more qubits, to initiate a measurement. In short order, the demodulator may sense photon emission from the one or more qubits and may assert a corresponding digital voltage level on a quantum-interface line into the controller. Generally speaking, any measurement of a quantum-mechanical state is defined by the operator O corresponding to the observable to be measured; the result R of the measurement is guaranteed to be one of the allowed eigenvalues of O. In quantum computer 10, R is statistically related to the qubit-register state prior to the measurement, but is not uniquely determined by the qubit-register state.

Pursuant to appropriate input from controller 18, quantum interface 30 may be configured to implement one or more quantum-logic gates to operate on the quantum state held in qubit register 12. Whereas the function of each type of logic gate of a classical computer system is described according to a corresponding truth table, the function of each type of quantum gate is described by a corresponding operator matrix. The operator matrix operates on (i.e., multiplies) the complex vector representing the qubit register state and effects a specified rotation of that vector in Hilbert space.

For example, the Hadamard gate H is defined by

$\begin{matrix} {H = {\frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & {- 1} \end{bmatrix}}} & (1) \end{matrix}$

The H gate acts on a single qubit; it maps the basis state |0

to (|0

+|1

)/√{square root over (2)}, and maps |1

to (|0

−|1

)/√{square root over (2)}. Accordingly, the H gate creates a superposition of states that, when measured, have equal probability of revealing |0

or |1

.

The phase gate S is defined by

$\begin{matrix} {S = {\begin{bmatrix} 1 & 0 \\ 0 & e^{i\; \pi \text{/}2} \end{bmatrix}.}} & (2) \end{matrix}$

The S gate leaves the basis state |0

unchanged but maps |1

to e^(iπ/2)|1

. Accordingly, the probability of measuring either |0

or |1

is unchanged by this gate, but the phase of the quantum state of the qubit is shifted. This is equivalent to rotating ψ by 90 degrees along a circle of latitude on the Bloch sphere of FIG. 2.

Some quantum gates operate on two or more qubits. The SWAP gate, for example, acts on two distinct qubits and swaps their values. This gate is defined by

$\begin{matrix} {{SWAP} = {\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}.}} & (3) \end{matrix}$

The foregoing list of quantum gates and associated operator matrices is non-exhaustive, but is provided for ease of illustration. Other quantum gates include Pauli-X, -Y, and -Z gates, the √{square root over (NOT)} gate, additional phase-shift gates, the √{square root over (SWAP)} gate, controlled cX, cY, and cZ gates, and the Toffoli, Fredkin, Ising, and Deutsch gates, as non-limiting examples.

Continuing in FIG. 1, suitably configured signal from modulators 32 of quantum interface 30 may interact physically with one or more qubits 14 of qubit register 12 so as to assert any desired quantum-gate operation. As noted above, the desired quantum-gate operations are specifically defined rotations of a complex vector representing the qubit register state. In order to effect a desired rotation O, one or more modulators of quantum interface 30 may apply a predetermined signal level S_(i) for a predetermined duration T_(i). In some examples, plural signal levels may be applied for plural sequenced or otherwise associated durations, as shown in FIG. 3, to assert a quantum-gate operation on one or more qubits of the qubit register. In general, each signal level S_(i) and each duration T_(i) is a control parameter adjustable by appropriate programming of controller 18.

The term ‘oracle’ is used herein to describe a predetermined sequence of elementary quantum-gate and/or measurement operations executable by quantum computer 10. An oracle may be used to transform the quantum state of qubit register 12 to effect a classical or non-elementary quantum-gate operation or to apply a density operator, for example. In some examples, an oracle may be used to enact a predefined ‘black-box’ operation ƒ(x), which may be incorporated in a complex sequence of operations. To ensure adjoint operation, an oracle mapping n input qubits |x

to m output or ancilla qubits |y=ƒ(x)) may be defined as a quantum gate O(|x

⊗|y

) operating on the (n+m) qubits. In this case, O may be configured to pass the n input qubits unchanged but combine the result of the operation ƒ(x) with the ancillary qubits via an XOR operation, such that O(|x)

⊗|y

)=|x

⊗|y ⊗ƒ(x)

. As described further below, a Gibbs-state oracle is an oracle configured to generate a Gibbs state based on a quantum state of specified qubit length.

Implicit in the description herein is that each qubit 14 of qubit register 12 may be interrogated via quantum interface 30 so as to reveal with confidence the standard basis vector |0

or |1

that characterizes the quantum state of that qubit. In some implementations, however, measurement of the quantum state of a physical qubit may be subject to error. Accordingly, any qubit 14 may be implemented as a logical qubit, which includes a grouping of physical qubits measured according to an error-correcting oracle that reveals the quantum state of the logical qubit with confidence.

Within the last several years, quantum machine learning has emerged as a significant motivation for developing quantum computers. In theory, quantum computers are naturally poised to model various real-world problems to which classical models are difficult to apply. Relative to the analogous classical model, a quantum model may be more accurate, more private, or faster to train, for example. Significantly, quantum computers may be capable of modeling probability distributions that, when represented by classical models, cannot be sampled efficiently. This ability may provide a broader or richer family of distributions than could be realized using a polynomial-sized classical model.

Many examples in which quantum machine learning may provide such advantages involve data having inherently quantum features, e.g., physical, chemical, and/or biological data. A quantum machine-learning dataset may include inter-atomic energy potentials, molecular atomization energy data, polarization data, molecular orbital eigenvalue data, protein or nucleic-acid folding data, etc. In some examples, quantum machine learning models may be suitable for simulating, evaluating, and/or designing physical quantum systems. For example, quantum machine learning models may be used to predict behavior of nanomaterials (e.g., quantum dot charge states, quantum circuitry, and the like). In some examples, quantum machine learning models may be suitable for tomography and/or partial tomography of quantum systems, e.g., approximately cloning an oracle system represented by an unknown density operator. Numerous other examples are equally envisaged.

Classical data is traditionally fed to a quantum algorithm in the form of a training set, or test set, of vectors. But rather than train on individual vectors, as one does in classical machine learning, quantum machine learning provides an opportunity to train on quantum-state vectors. More specifically, if the classical training set is thought of as a distribution over input vectors, then the analogous quantum training set would be a density operator, denoted ρ, which operates on the global quantum state of the network.

The goals in quantum machine learning vary from task to task. For unsupervised generative tasks, a common goal is to find, by experimenting with ρ, a process V such that V:|0

σ such that ∥ρ−σ∥ is small. Such a task corresponds, in quantum information language, to partial tomography or approximate cloning. Alternatively, supervised learning tasks are possible, in which the task is not to replicate the distribution but rather to replicate the conditional probability distributions over a label subspace. This approach is frequently taken in QAOA-based quantum neural networks.

In recent years, quantum Boltzmann machines have emerged as one of the most promising architectures for quantum neural networks. So that the reader can more easily understand the function of the quantum Boltzmann machine, the classical variant of the Boltzmann machine will first be described, with reference to FIGS. 4A and 4B. The skilled reader will understand that some but not all aspects of this description are relevant also to the quantum variant, which is further described hereinafter.

FIG. 4A shows aspects of a Boltzmann machine 40, in one example. Every Boltzmann machine includes one or more of visible nodes v_(i) and may also include one or more hidden nodes h_(i). The term ‘unit’ may also be used to refer to a node of a Boltzmann machine; these terms are used interchangeably herein. Only the visible nodes receive data from outside the Boltzmann machine. While FIG. 4A shows four visible and four hidden nodes, other combinations of visible and hidden nodes are also envisaged, and certainly the numbers of visible and hidden nodes need not be equal. Each visible node v_(i) and each hidden node h_(i) of classical Boltzmann machine 40 is characterized by a state variable s_(i), which may have a value of 0 or 1. The collective states of the visible and hidden nodes are expressible, therefore, as a binary vectors v and h, respectively.

Collectively, the {s_(i)} determines the global energy E of classical Boltzmann machine 40, according to

$\begin{matrix} {{{- E} = {{\sum\limits_{i < j}{w_{ij}s_{i}s_{j}}} + {\sum\limits_{i}{\theta_{i}s_{i}}}}},} & (4) \end{matrix}$

where each θ_(i) defines the bias of s_(i) on the energy, and each w_(ij) defines the weight of an additional energy of interaction, or ‘connection strength’, between nodes i and j.

During operation of Boltzmann machine 40, the equilibrium state is approached by resetting the state variable s_(i) of each of a sequence of randomly selected nodes according to a statistical rule. The rule for a classical Boltzmann machine is that the ensemble of nodes adheres to the Boltzmann distribution of statistical mechanics. In other words,

$\begin{matrix} {{{\Delta \; E_{i}} = {{- k_{B}}T\mspace{14mu} \ln \frac{p_{s_{i} = 0}}{p_{s_{i} = 1}}}},} & (5) \end{matrix}$

where p_(s) _(i) ₌₀ and p_(s) _(i) ₌₁ express the probabilities that s_(i)=0 or 1, respectively, where ΔE_(i) is the change in the energy E of the ensemble of nodes due to the transition of the single node i from the s_(i)=0 to the s_(i)=1 state, where k_(B) is a constant, and where T is an adjustable parameter akin to the ‘temperature’ of the system. The reader will note that ΔE_(i) is readily obtained from Eq 4, above. Substituting p_(s) _(i) ₌₀=1−p_(s) _(i) ₌₁, the logistic function for the classical Boltzmann machine is obtained,

$\begin{matrix} {p_{s_{i} = 1} = {\frac{1}{1 + e^{{- \Delta}\; E_{i}\text{/}T}}.}} & (6) \end{matrix}$

After a sufficient number of iterations at a predetermined T, the probability of observing any global state {s_(i)} will depend only upon the energy of that state, not on the initial state from which the process was started. At that limit, the Boltzmann machine has achieved ‘thermal equilibrium’ at temperature T. In the optional method of ‘simulated annealing’, T is gradually transitioned from higher to lower values during the approach to equilibrium, in order to increase the likelihood of descending to a global energy minimum.

In useful examples, a Boltzmann machine is trained to converge to one or more desired global states using an external training distribution over such states. During the training, biases θ_(i) and weights w_(ij) are adjusted so that the global states with the highest probabilities have the lowest energies. To illustrate the training method, let P⁺(v) be a distribution of training data over the vector of visible nodes v, and let P⁻(v) be a distribution of thermally equilibrated states of the Boltzmann machine, which have been ‘marginalized’ over the hidden nodes of the machine. An appropriate measure of the dissimilarity of the two distributions is the Kullback-Leibler (KL) divergence G, defined as

$\begin{matrix} {{G = {\sum\limits_{v}{{P^{+}(v)}\mspace{14mu} \ln \frac{P^{+}(v)}{P^{-}(v)}}}},} & (7) \end{matrix}$

where the sum is over all possible states v of v. In practice, a Boltzmann machine may be trained in two alternating phases: a ‘positive’ phase in which v is constrained to one particular binary state vector sampled from the training set (according to P+(v)), and a ‘negative’ phase in which the network is allowed to run freely. In this approach, the gradient with respect to a given weight, w_(ij) is given by

$\begin{matrix} {{\frac{\partial G}{\partial w_{ij}} = {- \frac{p_{ij}^{+} - p_{ij}^{-}}{R}}},} & (8) \end{matrix}$

where p_(ij) ⁺ is the probability that s_(i)=s_(j)=1 at thermal equilibrium of the positive phase, p_(ij) ⁻ is the probability that s_(i)=s_(j)=1 at thermal equilibrium of the negative phase, and where R is a learning-rate parameter. Finally, the biases are trained according to

$\begin{matrix} {\frac{\partial G}{\partial\theta_{i}} = {- {\frac{p_{i}^{+} - p_{i}^{-}}{R}.}}} & (9) \end{matrix}$

An important variant of the Boltzmann machine is the ‘restricted’ Boltzman machine (RBM). An example RBM 42 is represented in FIG. 4B. An RBM differs from a generic Boltzmann machine in that it has no connection (w_(ij)=0) between any pair of hidden nodes or any pair of visible nodes. Relative to the generic classical Boltzmann machine, the classical RBM is more easily trained and is applicable to a ‘deep-learning’ strategy in which the hidden nodes of a trained, upstream RBM are used to provide training data for training an adjacent downstream RBM, in a stacked, multilayer configuration.

Returning briefly to FIG. 1, quantum computer 10 may be configured to instantiate a quantum-computing analog of the classical Boltzmann machine, which is referred to herein as a quantum Boltzmann machine (QBM). The state {s_(i)} of the visible and hidden nodes of a QBM may be represented in the array of qubits 14 of qubit register 12. For example, the state of four visible nodes of a QBM may be represented in qubits 14A through 14D, and the state of four hidden nodes of the QBM may be represented in qubits 14E through 14H. In other examples, qubit register 12 may include, in addition to qubits corresponding to the visible and hidden nodes, one or more ‘ancilla’ qubits used to transiently store quantum states derived from the states of the visible and hidden nodes e.g., to implement an oracle. Naturally, any physical register of two or more qubits is divisible, as well as associable, so as to form any number of logical qubit registers visible, hidden, and ancilla registers, for example. The skilled reader will understand that a qubit register may be referred to as a ‘register’ in the description below.

Boltzmann machines are extensible to the quantum domain because they approximate the physics inherent in a quantum computer. In particular, a Boltzmann machine provides an energy for every configuration of a system and generates samples from the distribution of configurations with probabilities that depend exponentially on the energy. The same would be expected of a canonical ensemble in statistical physics. The explicit model in this case is

$\begin{matrix} {{\sigma_{v} = {{Tr}_{h}\left( \frac{e^{- H}}{Z} \right)}},} & (10) \end{matrix}$

where Tr_(h)(⋅) is the partial trace over an auxiliary subsystem known as the hidden subsystem, which serves to build correlations between nodes of the visible subsystem. Thus, the goal of generative quantum Boltzmann training is to choose σ_(v)=argmin_(H) (dist(ρ, σ_(v))) for an appropriate distance, or divergence, function. In this disclosure, the terms ‘loss function’, ‘cost function’, and ‘divergence function’ are used interchangeably.

The goal in training a QBM is to find a Hamiltonian that replicates a given input state as closely as possible. This is useful not only in generative applications, but can also be used for discriminative tasks by defining the visible unit subsystem to be composed of the tensor products of a visible subsystem and an output layer that yields the classification of the system. While generative tasks are the main focus here, it is straightforward to generalize this work to classification.

As noted above, for generative training on classical data, the natural divergence between the input and output distributions is the KL divergence. When the input is a quantum state, however, the quantum relative entropy is an appropriate measure of the divergence:

S(ρ|σ_(v))=Tr(ρ log ρ)−Tr(ρ log σ_(v)).  (11)

The skilled reader will note that Eq. 11 reduces to the KL divergence if ρ and σ_(v) are classical. Moreover, S becomes zero if and only if ρ=σ_(v).

While quantum relative entropy is generally difficult to compute, the gradient of the quantum relative entropy is readily available for QBMs with all visible units. In such cases, σ_(v)=e^(−H)/Z. Further, the relation log(e^(−H)/Z)=−H−log(Z) allows straightforward computation of the required matrix derivatives. On the other hand, no methods are known prior to this disclosure for generative training of QBMs using quantum relative entropy as a loss function when hidden units are present. This is because the partial trace in log(Tr_(h)e^(−H)/Z) prevents simplification of the logarithm term when computing the gradient.

The purpose of this disclosure is to provide practical methods for training generic QBMs that have hidden as well as visible units. Two variants are disclosed herein. The first and more efficient approach assumes a special form for the Hamiltonian, from which variational upper bounds on the quantum relative entropy are found, with an easy-to-compute derivatives. More specifically, the Hamiltonian acting on the hidden units commutes in this approach, such that the relevant gradients can be computed using a polynomial number of queries to a coherent Gibbs-state oracle. The second and more general approach uses recent techniques from quantum simulation to approximate the exact expression for the gradient of the relative entropy using Fourier-series approximations and high-order divided-difference formulas in place of the analytic derivative. Here, the exact gradient is computed using a polynomial (albeit greater) number of queries. Both methods are efficient provided that Gibbs-state preparation is efficient, which is expected to hold in most practical cases of Boltzmann training (although it is worth noting that efficient Gibbs-state preparation in general would imply QMA⊂BQP, which is unlikely to hold).

The interested reader is referred to the following list of references, which are cited below and hereby incorporated by reference herein.

-   [1] L D Landau and E M Lifshitz. Statistical physics, vol. 5. Course     of theoretical physics, 30, 1980. -   [2] Maria Kieferova and Nathan Wiebe. Tomography and generative data     modeling via quantum boltzmann training. arXiv preprint     arXiv:1612.05204, 2016. -   [3] Joran Van Apeldoorn, Andras Gilyén, Sander Gribling, and Ronald     de Wolf. Quantum sdp-solvers: Better upper and lower bounds. In     Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual     Symposium on, pages 403-414. IEEE, 2017. -   [4] David Poulin and Pawel Wocjan. Sampling from the thermal quantum     gibbs state and evaluating partition functions with a quantum     computer. Physical review letters, 103(22):220502, 2009. -   [5] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum     principal component analysis. Nature Physics, 10(9):631, 2014. -   [6] Shelby Kimmel, Cedric Yen-Yu Lin, Guang Hao Low, Maris Ozols,     and Theodore J Yoder. Hamiltonian simulation with optimal sample     complexity. npj Quantum Information, 3(1):13, 2017. -   [7] Howard E Haber. Notes on the matrix exponential and logarithm.     2018. -   [8] Nicholas J Higham. Functions of matrices: theory and     computation, volume 104. Siam, 2008. -   [9] Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp.     Quantum amplitude amplification and estimation. Contemporary     Mathematics, 305:53-74, 2002.

Returning now to the drawings, FIG. 5 illustrates an example method 50 to train a QBM having one or more visible nodes and one or more hidden nodes. The method uses quantum relative entropy as a cost function and includes estimation of the gradient of the quantum relative entropy.

At 52 of method 50, a QBM having visible and hidden nodes is instantiated in a quantum computer. Here each visible and each hidden node of the QBM is associated with a different corresponding qubit of a plurality of qubits of the quantum computer. As described above, the state of each of the plurality of qubits contributes to the global energy of the QBM according to a set of weighting factors.

At 54 initial values of the weighting factors—e.g., biases θ_(i) and weights w_(ij)—are provided to the QBM. As noted above, the initial values of the weighting factors may be incorporated into a distribution σ_(v) over the one or more visible nodes of the QBM, for example.

At 56 a predetermined distribution of training data is provided to the visible nodes of the QBM. Generally speaking, training data may be provided so as to span the entire visible subsystem of the QBM, or any subset thereof. For generative applications, for example, the training distribution may cover all of the visible nodes, whereas for some classification tasks, it may be sufficient to compute a training loss function (such as a classification error rate) on a subset of visible nodes designated as the ‘output units’ or ‘output qubits’. Accordingly, the plurality of qubits of the qubit register may include one or more designated output qubits corresponding to one, some, or all of the visible nodes of the QBM, and a distribution of training data is provided over the one or more output qubits.

As noted above, the distribution of training data may take the form of a density operator ρ, which represents the quantum state of the visible subsystem as a statistical distribution or mixture of pure quantum states. Accordingly, the density operator may represent a statistically-weighted collection of possible observations of a quantum system, analogous to a probability distribution over classical state vectors. In some examples, a density operator ρ may represent superpositions of different basis states and/or entangled states. Superposition states in quantum data may represent uncertainty and/or ambiguity in the data. Entangled states may represent correlations between states. Accordingly, a density operator ρ may be used to more precisely describe systems in which uncertainty and/or non-trivial correlations occur, relative to any competing classical distribution. In some examples, classical distributions of training data may be converted to an appropriate density operator for use in the methods herein.

At 58 the QBM is driven to thermal equilibrium by repeated resetting of the state of each node and application of a logistic measurement function. At 60 the gradient of the quantum relative entropy between the one or more output qubits and the distribution of training data is estimated with respect to the weighting factors, based on the thermally equilibrated qubit state held in the quantum computer. At 62 it is determined whether a minimum of the quantum relative entropy has been reached with respect to the set of weighting factors. If a minimum has not been reached, then the method advances to 64, where the weighting factors are adjusted (i.e., trained) based on the estimated gradient of the quantum relative entropy, using the quantum relative entropy as a cost function. From here the equilibrium state of the QBM is again approached, now using the adjusted weighting factors. If a minimum in the quantum relative entropy is reached at 62, then the training procedure concludes with the currently adjusted values of the weighting factors accepted as trained values. The trained QBM may now be provided, at 66, subsequent non-training distributions, for generative, discriminative, or classification tasks. In other examples, execution of method 50 may loop back to 56 where an additional training distribution is offered and processed.

Whereas computing the gradient of the average log-likelihood is a straightforward task when training a classical Boltzmann machine, finding the gradient of the quantum relative entropy is much harder. The reason is that, in general, [∂_(θ)H(θ), H(θ)]≠0. This means that certain rules for finding the derivative no longer hold. One important example, used repeatedly herein, is Duhamel's formula:

∂_(θ) e ^(H(θ))=∫₀ ¹ dse ^(H(θ)s)∂θ_(H)(θ)e ^(H(θ)(1−s)).  (12)

This formula is proven by expanding the operator exponential in a Trotter-Suzuki expansion with r time-slices, differentiating the result and then taking the limit as r→∞. However, the relative complexity of this expression compared to what would be expected from the product rule serves as an important reminder that computing the gradient is not a trivial exercise. A similar formula for the logarithm is provided in the Appendix herein.

Because this disclosure involves functions of matrices, the notion of monotonicity is relevant, in addition to a formal definition of a QBM. For some approximations to hold, it is also necessary to define the notion of concavity, in order to use Jensen's inequality.

Definition 1. Operator Monoticity

A function ƒ is operator monotone with respect to the semidefinite order if 0

A

B, for two symmetric positive definite operators, implies ƒ(A)

ƒ(B). A function is operator concave w.r.t. the semidefinite order if cƒ(A)+(1−c)ƒ(B)

ƒ(cA+(1−c)B) for all positive definite A, B, and c∈[0, 1].

Definition 2

A quantum Boltzmann machine (QBM) is defined as a quantum mechanical system that acts on a tensor product of Hilbert spaces

_(v)

_(h)∈

² ^(n) that correspond to the visible and hidden subsystems of the QBM. The QBM has a Hamiltonian of the form H∈

² ^(n) ^(×2) ^(n) such that ∥H−diag(H)∥>0. The QBM takes these parameters and then outputs a state of the form Tr_(h)

$\left( \frac{e^{- H}}{{Tr}\left( e^{- H} \right)} \right),$

where Tr_(h)(⋅) refers to the partial trace over the hidden subspace of the model.

As described above, the quantum relative entropy cost function for a QBM with hidden units is given by

_(ρ)(H)=S(ρ|Tr_(h)[e ^(−H)/Tr[e ^(−H)]]),  (13)

where S(ρ|σ):=Tr [ρ log ρ]−Tr [ρ log σ] is the quantum relative entropy. In the following, the reduced density matrix on the visible units of the QBM is denoted σ=Tr_(h) [e^(−H)/Tr [e^(−H)]]. A regularization term may be included in the expression above to penalize unnecessary quantum correlations in the model. For simplicity, however, such regularization has been omitted herein.

In the case of an all-visible QBM (which corresponds to dim(

_(h))=1) a closed form expression for the gradient of the quantum relative entropy is known:

$\begin{matrix} {{\frac{\partial{_{\rho}(H)}}{\partial\theta} = {- {{Tr}\left\lbrack {\frac{\partial}{\partial\theta}\rho \mspace{14mu} \log \mspace{14mu} \sigma} \right\rbrack}}},} & (14) \end{matrix}$

which can be simplified using log(exp(−H))=−H and Duhamels formula to obtain the following equation for the gradient. Denoting ∂_(θ):=∂/∂θ,

Tr[ρ∂_(θ) H]−Tr[e ^(−H)∂_(θ) H]/Tr[e ^(−H)].  (15)

However, the above gradient formula is not generally valid, and indeed does not hold, if hidden units are included. Allowing for hidden units, it is necessary to addition-ally trace out the hidden subsystem, which results in the majorised distribution σ_(v):=Tr_(h) [e^(−H)/Tr [e^(−H)]]. In general, the adapted cost function for the QBM with hidden units takes the form

_(ρ)(H)=S(ρ∥Tr_(h)[e ^(−H)]/Tr[^(−H)].  (16)

Note that H depends on variables that will be altered during the training process, while ρ is the target density matrix on which the training will have no influence. Therefore, in estimating the gradient of the above, it is possible to omit the Tr [ρ log ρ] and obtain:

$\begin{matrix} {{\frac{\partial{_{\rho}(H)}}{\partial\theta} = {- {{Tr}\left\lbrack {\frac{\partial}{\partial\theta}\rho \mspace{14mu} \log \mspace{14mu} \sigma_{v}} \right\rbrack}}},} & (17) \end{matrix}$

where with σ_(v):=Tr_(h) [e^(−H)]/Tr [e^(−H)] is denoted the reduced density matrix, which is marginalised over the hidden units.

Two different training variants will now be discussed. While the first is less general, it gives an easily implementable algorithm and strong bounds using based on optimizing a variational bound. The second approach is, on the other hand, applicable to any problem instance and represents a general-purpose gradient-optimisation algorithm for relative-entropy training. The no-free-lunch theorem suggests that no (good) bounds can be obtained without assumptions on the problem instance, and indeed, the general algorithm exhibits, potentially, exponentially worse complexity. However, it may still be possible to make use of the second variant for specific applications, particularly because it gives a generally applicable algorithm for training QBMs on a quantum device. This appears to be the first known result of this kind.

Approach 1: Variational Training for Restricted Hamiltonians

The first approach is based on a variational bound of the objective function, i.e., the quantum relative entropy. In order to operationalize this approach, certain assumptions on the Hamiltonian are relied upon. These assumptions are important, as several instances of scalar calculus fail on transitioning to matrix functional analysis, and, for gradient-based approaches in particular, the assumptions are required in order to obtain a feasible analytical solution.

The Hamiltonian for a QBM may be expressed as

H=H _(v) +H _(h) +H _(int),  (18)

which represents the energy operator acting on the visible units, the hidden units and a third interaction operator that creates correlations between the two. It is further assumed, for simplicity, that there are two sets of operators {v_(k)} and {h_(k)} composed of D=W_(v)+W_(h)+W_(int) terms with

$\begin{matrix} {{{H_{v} = {\sum\limits_{k = 1}^{W_{v}}\; {\theta_{k}{v_{k} \otimes I}}}},{H_{h} = {\sum\limits_{k = {W_{v} + 1}}^{W_{v} + W_{h}}\; {\theta_{k}{I \otimes h_{k}}}}}}{{H_{int} = {\sum\limits_{k = {W_{v} + W_{h} + 1}}^{W_{v} + W_{h} + W_{int}}\; {\theta_{k}{v_{k} \otimes h_{k}}}}},{\left\lbrack {h_{k},h_{j}} \right\rbrack = {0\mspace{14mu} {\forall j}}},{k.}}} & (19) \end{matrix}$

While this implies that the Hamiltonian can in general be expressed as

$\begin{matrix} {{H = {\sum\limits_{k = 1}^{D}\; {\theta_{k}{v_{k} \otimes h_{k}}}}},} & (20) \end{matrix}$

it is useful to break up the Hamiltonian into the above form, to emphasize the qualitative difference between the types of terms that can appear in this model.

The intention of the form of the Hamiltonian in Eq. 19 is to force the non-commuting terms to act only on the visible units of the model. In contrast, only commuting Hamiltonian terms act on the hidden register. Since the hidden units commute, the eigenvalues and eigenvectors for the Hamiltonian can be expressed as text use

H|v _(h)

⊗|h

=λ _(v) _(h) _(,h) |v _(h)

⊗|h

,  (21)

where both the conditional eigenvectors and eigenvalues for the visible subsystem are functions of the eigenvector |h

obtained in the hidden register. This allows the hidden units to select between eigenbases to interpret the input data while also penalizing portions of the accessible Hilbert space that are not supported by the training data. However, since the hidden units commute, they cannot be used to construct a non-diagonal eigenbasis. This division of labor between the visible and hidden layers not only helps build intuition about the model but also opens up the possibility for more efficient training algorithms to exploit this fact.

For the following result, a variational bound is used in order to train the QBM weights for a Hamiltonian H of the form given in Eq. 20. The variational bound is expressible compactly in terms of a thermal expectation against a fictitious thermal probability distribution, as defined below.

Definition 3

Let {tilde over (H)}_(h)=Σ_(k)θ_(k)Tr [ρv_(k)] h_(k) be the Hamiltonian acting conditioned on the visible subspace only on the hidden subsystem of the Hamiltonian H:=Σ_(k)θ_(k)v_(k)⊗h_(k). Then the expectation value over the marginal distribution over the hidden variables h is defined as:

$\begin{matrix} {{_{h}( \cdot )} = {\sum\limits_{h}{\frac{( \cdot )r^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}.}}} & (22) \end{matrix}$

This definition is now used to state the expression for the gradient of the variational bound on the quantum relative entropy.

Definition 4

Assume that the Hamiltonian of the QBM takes the form H, where θ_(k) are the parameters that determine the interaction strength and v_(k), h_(k) are unitary operators. Furthermore, let h_(k)|h

=E_(h,k)|h

be the eigenvalues of the hidden subsystem, and let

_(h)(⋅) be as given by Def. 3, i.e., the expectation value over the effective Boltzmann distribution of the visible layer with {tilde over (H)}_(h):=Σ_(h,k)θ_(k)v_(k). Finally, let the variational upper bound of the objective function be given by

$\begin{matrix} {{{{\overset{\sim}{S}\left( {\rho H} \right)}\mspace{14mu} \text{:=}\mspace{14mu} {{Tr}\left\lbrack {\rho \mspace{14mu} \log \mspace{14mu} \rho} \right\rbrack}} + {{Tr}\left\lbrack {{\rho {\sum\limits_{k}{_{h}\left\lbrack {E_{h,k}\theta_{k}v_{k}} \right\rbrack}}} + {_{h}\left\lbrack {\log \mspace{14mu} \alpha_{h}} \right\rbrack}} \right\rbrack} - {\log \mspace{14mu} Z}},\mspace{76mu} {where}} & (23) \\ {\mspace{76mu} {\alpha_{h} = {\frac{e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}.}}} & (24) \end{matrix}$

This is the corresponding Gibbs distribution for the visible units.

Lemma 1.

Under the assumptions of Def. 4, {tilde over (S)} is a variational upper bound on the quantum relative entropy, meaning that {tilde over (S)}(ρ|H)≥S(ρ|e^(−H)/Z). Furthermore, the derivatives of this upper bound with respect to the parameters of the Boltzmann machine are

$\begin{matrix} {\frac{\partial{\overset{\sim}{S}\left( {\rho H} \right)}}{\partial_{\theta_{p}}} = {{_{h}\left\lbrack {{Tr}\left( {\rho \; E_{h,p}v_{p}} \right\rbrack} \right\rbrack} - {{{Tr}\left\lbrack {\frac{\partial H}{\partial\theta_{p}}\frac{e^{- H}}{Z}} \right\rbrack}.}}} & (25) \end{matrix}$

Proof.

First derived is the gradient of the normalization term (Z) in the relative entropy, which can be trivially evaluated using Duhamels formula to obtain

$\begin{matrix} {{\frac{\partial}{\partial\theta_{p}}\log \mspace{14mu} {{Tr}\left\lbrack e^{- H} \right\rbrack}} = {{- {{Tr}\left\lbrack {\frac{\partial H}{\partial\theta_{p}}\frac{e^{- H}}{Z}} \right\rbrack}} = {- {{{Tr}\left\lbrack {\sigma {\partial_{\theta_{p}}H}} \right\rbrack}.}}}} & (26) \end{matrix}$

Note that this term is evaluated by first preparing the Gibbs state σ_(Gibbs):=e^(−H)/Z and then evaluating the expectation value of the operator ∂_(θ) _(j) H w.r.t. the Gibbs state, using amplitude estimation for the Hadamard test. If T_(Gibbs) is the query complexity for the Gibbs state preparation, then the query complexity of the whole algorithm including the phase estimation step is then given by O(T_(Gibbs)/{tilde over (∈)}) for an {tilde over (∈)}-accurate estimate of phase estimation.

Proceeding now with the gradient evaluations for the model, the reader will recall that the Hamiltonian H is assumed to take the form

$\begin{matrix} {{H\mspace{14mu} \text{:=}\mspace{14mu} {\sum\limits_{k}{\theta_{k}{v_{k} \otimes h_{k}}}}},} & (27) \end{matrix}$

where v_(k) and h_(k) are operators acting on the visible and hidden units respectively, and that h_(k)=d_(k) is assumed to be diagonal in the chosen basis. Under the assumption that [h_(i), h_(j)]=0, ∇i, j, c.f., the assumptions in Eq. 19, there exists a basis {|h

} for the hidden subspace such that h_(k)|h

=E_(h,k)|h

. With these assumptions, the logarithm may be reformulated as

$\begin{matrix} {{\log \mspace{14mu} {{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}} = {\log\left( {\sum\limits_{v,v^{\prime},h}{{\langle{v,{h{e^{{- \Sigma_{k}}\theta_{k}{v_{k} \otimes h_{k}}}}v^{\prime}},h}\rangle}{v\rangle}{\langle v^{\prime}}}} \right)}} & (28) \\ {= {\log\left( {\sum\limits_{v,v^{\prime},h}{{\langle{v{e^{{- \Sigma_{k}}E_{h,k}\theta_{k}v_{k}}}v^{\prime}}\rangle}{v\rangle}{\langle v^{\prime}}}} \right)}} & (29) \\ {{= {\log\left( {\sum\limits_{h}e^{{- \Sigma_{k}}E_{h,k}\theta_{k}v_{k}}} \right)}},} & (30) \end{matrix}$

where it is important to note that v_(k) are operators, and hence, the matrix representation of these are used in the last step. In order to further simplify this expression, it is noted that each term in the sum is a positive semi-definite operator. In particular, it will be noted that the matrix logarithm is operator concave and operator monotone, and hence by Jensen's inequality, for any sequence of non-negative number {α_(i)}:Σ_(i)α_(i)=1,

$\begin{matrix} {{\log\left( \frac{\sum\limits_{i = 1}^{N}\; {\alpha_{i}U_{i}}}{\Sigma_{j}\alpha_{j}} \right)} \geq {\frac{\sum\limits_{i = 1}^{N}\; {\alpha_{i}\mspace{14mu} \log \mspace{14mu} \left( U_{i} \right)}}{\Sigma_{j}\alpha_{j}}.}} & (31) \end{matrix}$

Further, since Tr [ρ log ρ]−Tr [ρ log σ_(v)] is being optimized, for arbitrary choice of {α_(i)}_(i) under the above constraints,

$\begin{matrix} {{{Tr}\left\lbrack {\rho \mspace{14mu} {\log \left( {\sum\limits_{h = 1}^{N}\; e^{{- \Sigma_{k}}E_{h,k}\theta_{k}v_{k}}} \right)}} \right\rbrack} = {{{Tr}\left\lbrack {\rho \mspace{14mu} {\log \left( {\sum\limits_{h = 1}^{N}\; {\alpha_{h}\frac{e^{{- \Sigma_{k}}E_{h,k}\theta_{k}v_{k}}\text{/}\alpha_{h}}{\Sigma_{h^{\prime}}\alpha_{h^{\prime}}}}} \right)}} \right\rbrack} \geq {- {{{Tr}\left\lbrack {\rho \frac{{\Sigma_{h}\alpha_{h}\Sigma_{k}E_{h,k}\theta_{k}v_{k}} + {\Sigma_{h}\alpha_{h}\mspace{14mu} \log \mspace{14mu} \alpha_{h}}}{\Sigma_{h^{\prime}}\alpha_{h^{\prime}}}} \right\rbrack}.}}}} & (32) \end{matrix}$

Hence, the variational bound on the objective function for any {α_(i)}_(i) is

$\begin{matrix} {{{_{\rho}(H)} = {{{{Tr}\left\lbrack {\rho \mspace{14mu} \log \mspace{14mu} \rho} \right\rbrack} - {{Tr}\left\lbrack {\rho \mspace{14mu} \log \mspace{14mu} \sigma_{v}} \right\rbrack}} \leq {{{Tr}\left\lbrack {\rho \mspace{14mu} \log \mspace{14mu} \rho} \right\rbrack} + {{Tr}\left\lbrack {\rho \frac{{\Sigma_{h}\alpha_{h}\Sigma_{k}E_{h,k}\theta_{k}v_{k}} + {\Sigma_{h}\alpha_{h}\mspace{14mu} \log \mspace{14mu} \alpha_{h}}}{\Sigma_{h^{\prime}}\alpha_{h^{\prime}}}} \right\rbrack} + {\log \mspace{14mu} Z\mspace{14mu} \text{=:}\mspace{14mu} \overset{\sim}{S}}}}},} & (33) \end{matrix}$

which yields

$\begin{matrix} {\frac{\partial\overset{\sim}{S}}{\partial_{\theta_{p}}} = {{- {{Tr}\left\lbrack {\frac{\partial H}{\partial\theta_{p}}\frac{e^{- H}}{Z}} \right\rbrack}} + {{Tr}\left\lbrack {\frac{\partial}{\partial\theta_{p}}\rho {\sum\limits_{h}{\alpha_{h}{\sum\limits_{k}{E_{h,k}\theta_{k}v_{k}}}}}} \right\rbrack} + {\frac{\partial}{\partial\theta_{p}}{\sum\limits_{h}{\alpha_{h}\mspace{14mu} \log \mspace{14mu} \alpha_{h}}}}}} & (34) \\ {= {{- {{Tr}\left\lbrack {\frac{\partial H}{\partial\theta_{p}}\frac{e^{- H}}{Z}} \right\rbrack}} + {\frac{\partial}{\partial\theta_{p}}\left( {{\sum\limits_{h}{\alpha_{h}{{Tr}\left\lbrack {\rho {\sum\limits_{k}{E_{h,k}\theta_{k}v_{k}}}} \right\rbrack}}} + {\sum\limits_{h}{\alpha_{h}\mspace{14mu} \log \mspace{14mu} \alpha_{h}}}} \right)}}} & (35) \end{matrix}$

where the first term results from the partition sum. The former term can be seen as a new effective Hamiltonian, while the latter term is the entropy. The latter term hence resembles the free energy F(h)=E(h)−TS(h), where E(h) is the mean energy of the effective system with energies E(h):=Tr [ρΣ_(k)E_(h,k)θ_(k)v_(k)], T the temperature and S(h) the Shannon entropy of the α_(h) distribution. Now α_(h) terms are chosen to minimize this variational upper bound.

It is well-established in statistical physics, see for example [1], that the distribution that maximizes the free energy is the Boltzmann (or Gibbs) distribution, i.e.,

$\begin{matrix} {{\alpha_{h} = \frac{e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}},} & (36) \end{matrix}$

where {tilde over (H)}_(h):=Σ_(k)E_(h,k)θ_(k)v_(k) is a new effective Hamiltonian on the visible units, and the {α_(i)} are given by the corresponding Gibbs distribution for the visible units.

Therefore, the gradients can be taken with respect to this distribution and the bound above, where Tr [ρ{tilde over (H)}_(h)] is the mean energy of the effective visible system w.r.t. the data-distribution. For the derivative of the energy term,

$\begin{matrix} {\mspace{76mu} {{\frac{\partial}{\partial\theta_{p}}{\sum\limits_{h}{\alpha_{h}{{Tr}\left\lbrack {\rho {\sum\limits_{k}{E_{h,k}\theta_{k}v_{k}}}} \right\rbrack}}}} =}} & (37) \\ {= {\sum\limits_{h}\left( {{{\alpha_{h}\left( {{_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack} - {{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack}} \right)}{{Tr}\left\lbrack {\rho \; {\overset{\sim}{H}}_{h}} \right\rbrack}} + {\alpha_{h}{{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack}}} \right)}} & (38) \\ {{= {_{h}\left\lbrack {{\left( {{_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack} - {{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack}} \right){{Tr}\left\lbrack {\rho \; {\overset{\sim}{H}}_{h}} \right\rbrack}} + {{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack}} \right\rbrack}},} & (39) \end{matrix}$

while the entropy term yields

$\begin{matrix} {{\frac{\partial}{\partial\theta_{p}}{\sum\limits_{h}{\alpha_{h}\mspace{14mu} \log \mspace{14mu} \alpha_{h}}}} = {{\sum\limits_{h}{\alpha_{h}\left( {{\left\lbrack {{{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack} - {_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack}} \right\rbrack {{Tr}\left\lbrack {\rho \; {\overset{\sim}{H}}_{h}} \right\rbrack}} - {{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack}} \right)}} + {\sum\limits_{h}{{\alpha_{h}\left( {{{Tr}\left\lbrack {\rho \; E_{h,p}} \right\rbrack} - {_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack}} \right)}\log \mspace{14mu} {{Tr}\left\lbrack e^{- {\overset{\sim}{H}}_{h}} \right\rbrack}}} + {{_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack}.}}} & (40) \end{matrix}$

This can be further simplified to

$\begin{matrix} {\sum\limits_{h}{{\alpha_{h}\left( {{{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack} - {_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack}} \right)}{{Tr}\left\lbrack {\rho \; {\overset{\sim}{H}}_{h}} \right\rbrack}}} & (41) \\ {= {{_{h}\left\lbrack {\left( {{{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} - {_{h^{\prime}}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h^{\prime},p}v_{p}} \right\rbrack} \right\rbrack}} \right){{Tr}\left\lbrack {\rho \; {\overset{\sim}{H}}_{h}} \right\rbrack}} \right\rbrack}.}} & (42) \end{matrix}$

The resulting gradient for the variational bound for the visible terms is hence given by

$\begin{matrix} {\frac{\partial\overset{\sim}{S}}{\partial_{\theta_{p}}} = {{_{h}\left\lbrack {{Tr}\left\lbrack {\rho \; E_{h,p}v_{p}} \right\rbrack} \right\rbrack} - {{{Tr}\left\lbrack {\frac{\partial H}{\partial\theta_{p}}\frac{e^{- H}}{Z}} \right\rbrack}.}}} & (43) \end{matrix}$

Notably, if one considers no interactions between the visible and hidden units, then indeed the gradient above reduces to the case of the visible Boltzmann machine, which was treated in [2], resulting in the gradient

$\begin{matrix} {{{{Tr}\left\lbrack {\rho {\partial_{\theta_{p}}H}} \right\rbrack} - {{Tr}\left\lbrack {\frac{e^{- H}}{Z}{\partial_{\theta_{p}}H}} \right\rbrack}},} & (44) \end{matrix}$

under applicable assumptions on the form of H, ∂_(θ) _(p) H=v_(p).

Operationalizing the Gradient-Based Training.

From Lemma 1, it is known that the derivative of the relative entropy w.r.t. any parameter θ_(p) can be stated as

$\begin{matrix} {\frac{\partial\overset{\sim}{S}}{\partial_{\theta_{p}}} = {{{_{h}\left\lbrack E_{h,p} \right\rbrack}{{Tr}\left\lbrack {\rho \; v_{p}} \right\rbrack}} - {{{Tr}\left\lbrack {\frac{\partial H}{\partial\theta_{p}}\frac{e^{- H}}{Z}} \right\rbrack}.}}} & (45) \end{matrix}$

Since evaluating the latter part is, as mentioned above, straightforward, an algorithm is now given for evaluating the first part.

In view of the foregoing analysis, it is possible to evaluate each term Tr [ρv_(k)] individually for all k∈[D], i.e., all D dimensions of the gradient via the Hadamard test for v_(k), assuming v_(k) is unitary. More generally, for non-unitary v_(k) one could evaluate this term using a linear combination of unitary operations. Therefore, the remaining task is to evaluate the terms

_(h)[E_(h,p)] in Eq. 45, which reduces to sampling according to the distribution {α_(h)}. For this it is necessary to be able to create a Gibbs distribution for the effective Hamiltonian {tilde over (H)}_(h)=Σ_(k)θ_(k)Tr [ρv_(k)] h_(k) that contains only D terms and can hence be evaluated efficiently as long as D is small, which can generally be assumed. In order to sample according to the distribution {α_(h)}, the factors θ_(k)Tr [ρv_(k)] in the sum over k are first evaluated via the Hadamard test, and then used in order to implement the Gibbs distribution exp (−{tilde over (H)}_(h))/{tilde over (Z)} for the Hamiltonian

$\begin{matrix} {{\overset{\sim}{H}}_{h} = {\sum\limits_{k}{\theta_{k}{{Tr}\left\lbrack {\rho \; v_{k}} \right\rbrack}{h_{k}.}}}} & (46) \end{matrix}$

To this end, the results of [3] are adapted in order to prepare the corresponding Gibbs state, although alternative methods may also be used, e.g., [4].

Theorem 1.

Gibbs state preparation [3]. Suppose that I

H and we are given K∈

₊ such that ∥H∥≤2K, and let H∈

^(N×N) be a d-sparse Hamiltonian, and we know a lower bound z≤Z=Tr [e^(−H)]. If ϵ∈(0, ⅓), then we can prepare a purified Gibbs state |γ

_(AB) such that

$\begin{matrix} {{{{{Tr}_{B}\left\lbrack {{\gamma\rangle}{\langle\gamma }_{AB}} \right\rbrack} - \frac{e^{- H}}{Z}}} \leq \epsilon} & (47) \end{matrix}$

using

$\begin{matrix} {\overset{\sim}{}\left( {\sqrt{\frac{N}{z}}{Kd}\mspace{14mu} \log \left( \frac{K}{\epsilon} \right){\log \left( \frac{1}{\epsilon} \right)}} \right)} & (48) \end{matrix}$

queries, and

$\begin{matrix} {\overset{\sim}{}\left( {\sqrt{\frac{N}{z}}{Kd}\mspace{14mu} {\log \left( \frac{K}{\epsilon} \right)}{{\log \left( \frac{1}{\epsilon} \right)}\left\lbrack {{\log (N)} + {\log^{5\text{/}2}\left( \frac{K}{\epsilon} \right)}} \right\rbrack}} \right)} & (49) \end{matrix}$

gates.

Note that by using the above algorithm with {tilde over (H)}/2, the preparation of the purified Gibbs state will result in the state

$\begin{matrix} {{{{{{{{\psi}\rangle}_{Gibbs}\mspace{14mu} \text{:=}\mspace{14mu} {\sum\limits_{h}\frac{e^{{- E_{h}}\text{/}2}}{\sqrt{Z}}}}h}\rangle}_{A}\varphi_{h}}\rangle}_{B},} & (50) \end{matrix}$

where |ϕ_(j)

_(B) are mutually orthogonal trash states, which can typically be chosen to be |h

, i.e., a copy of the first register, which is irrelevant for our computation, and |h

_(A) are the eigenstates of {tilde over (H)}. Tracing out the second register will hence result in the corresponding Gibbs state

$\begin{matrix} {{\sigma_{h}\mspace{14mu} \text{:=}\mspace{14mu} {\sum\limits_{h}{\frac{e^{- E_{h}}}{Z}{h\rangle}{\langle h}_{A}}}},} & (51) \end{matrix}$

and hence the Hadamard test may now be used with input h_(k) and σ_(h), i.e., the operators on the hidden units and the Gibbs state, and estimate the expectation value

_(h) [E_(h,k)]. Such a method is provided below.

Theorem 2.

Under the assumptions of Lemma 1 and Theorem 1,

∈

^(D) can be computed for such that for any ϵ∈(0, max{⅓, 4 max_(h,p)|E_(h,p)|}) such that

$\begin{matrix} {{{{ - {\nabla\overset{\sim}{}}}}_{\max} \leq \epsilon},{with}} & (52) \\ {{\overset{\sim}{}\left( {\sqrt{\xi}\frac{D{\theta }_{1}\mspace{14mu} {dn}^{2}}{\epsilon}} \right)},} & (53) \end{matrix}$

queries to the oracle O_(H) and O_(ρ) with probability at least ⅔, where ξ:=max[N/z, N_(h)/z_(h)], N=2^(n), N_(h)=2^(n) ^(h) , and z, z_(h) are known lower bounds on the partition functions for the Gibbs state of H and {tilde over (H)}_(h) respectively.

Proof.

Conceptually, the following steps are performed, starting with Gibbs state preparation followed by a Hadamard test coupled with amplitude estimation to obtain estimates of the probability of a 0 measurement. The proof follows straight from the following algorithm.

-   -   1. One starts by preparing a Hadamard test state, i.e., let

${{{\psi\rangle}_{Gibbs}:} = {\Sigma_{h}\frac{e^{{- E_{h}}/2}}{\sqrt{Z}}{h\rangle}_{A}{\varphi_{h}\rangle}_{B}}},$

be the purified Gibbs state. Then an ancilla qubit is prepared in the |+

-state, and a controlled-h_(k) conditioned on the ancilla register is applied, followed by a Hadamard gate, i.e.,

½(|0

(|ψ

_(Gibbs)+(h _(k) ⊗I)|ψ

_(Gibbs))+|1

(|ψ

_(Gibbs)−(h _(k) ⊗I)|ψ

_(Gibbs)))  (54)

-   -   2. Next an amplitude estimation is performed on the |0         state. Let reflector Z:=−2|0         0|+I, where I is the identity which is just the Pauli Z matrix         up to a global phase, and let G:=(2|ϕ         ϕ|−I) (Z⊗I), for |ϕ         being the state after the Hadamard test, prior to the         measurement. The operator G has then the eigenvalue         μ_(±)=±e^(±i2θ) where 2θ=arcsin Pr(0), and Pr(0) is the         probability to measure the ancilla qubit in the |0         state. Let now T_(Gibbs) be the query complexity for preparing         the purified Gibbs state, given in Eq. 48. It is now possible to         perform phase estimation with precision e for the operator G         requiring O(T_(Gibbs)/{tilde over (ϵ)}) queries to the oracle of         H.     -   3. Note that the above will return an {tilde over (ϵ)}-estimate         of the probability of the Hadamard test to return 0, if we         measure now the phase estimation register. For the outcome, note         that

$\begin{matrix} \begin{matrix} {{\Pr (0)} = {\frac{1}{2}\left( {1 + {{Re}{\langle{\psi {_{Gibbs}{\left( {h_{k} \otimes I} \right){\psi\rangle}_{Gibbs}}}}}}} \right)}} \\ {= {\frac{1}{2}\left( {1 + {\sum\limits_{h}\frac{e^{- E_{h}}E_{h,k}}{Z}}} \right)}} \\ {{= {\frac{1}{2}\left( {1 + {_{h}\left\lbrack E_{h,k} \right\rbrack}} \right)}},} \end{matrix} & (55) \end{matrix}$

-   -   from which one can easily infer the estimate of         _(h) [E_(h,k)] up to precision e for all the k terms.

From the above it is seen that the runtime constitutes the query complexity of preparing the Gibbs state

$\begin{matrix} {{T_{Gibbs}^{V} = {\overset{\sim}{}\left( {\sqrt{\frac{2^{n}}{Z}}\frac{{{H(\theta)}}d}{\epsilon}{\log \left( \frac{{H(\theta)}}{\overset{\sim}{\epsilon}} \right)}{\log \left( \frac{1}{\overset{\sim}{\epsilon}} \right)}} \right)}},} & (56) \end{matrix}$

where 2^(n) is the dimension of the Hamiltonian, as given in Theorem 2 and combining it with the query complexity of the amplitude estimation procedure, i.e., 1/ϵ. However, in order to obtain a final error of ϵ, the error in the Gibbs state preparation must also be accounted for. For this, note that terms of the form Tr_(AB) [

ψ|_(Gibbs) (h_(k)⊗I)|ψ

_(Gibbs) ^(V)]=Tr_(AB) [(h_(k)⊗I)|ψ

_(Gibbs) ^(V)

ψ|_(Gibbs) ^(V)] may be estimated. The error w.r.t. the true Gibbs state σ_(Gibbs) is therefore be estimated as

$\begin{matrix} {{{{Tr}_{AB}\left\lbrack {\left( {h_{k} \otimes I} \right){\psi\rangle}_{Gibbs}^{V}{\langle\psi }_{Gibbs}^{V}} \right\rbrack} - {{Tr}_{A}\left\lbrack {h_{k}\sigma_{Gibbs}} \right\rbrack}} = {{{Tr}_{A}\left\lbrack {{h_{k}{{Tr}_{B\;}\left\lbrack {{\psi\rangle}_{Gibbs}^{V}{\langle\psi }_{Gibbs}^{V}} \right\rbrack}} - {h_{k}\sigma_{Gibbs}}} \right\rbrack} \leq {\sum\limits_{i}{{\sigma_{i}\left( h_{k} \right)}{{{{Tr}_{B}\left\lbrack {{\psi\rangle}_{Gibbs}^{V}{\langle\psi }_{Gibbs}^{V}} \right\rbrack} - \sigma_{Gibbs}}}}} \leq {\overset{\sim}{\epsilon}{\sum\limits_{i}{{\sigma_{i}\left( h_{k} \right)}.}}}}} & (57) \end{matrix}$

For the final error being less then ϵ, the precision used in the phase estimation procedure, it is necessary to set {acute over (ϵ)}=ϵ/(2Σ_(i)σ_(i)(h_(k)))≤2^(−n−1)ϵ, reminding that h_(k) is unitary, and similarly precision ϵ/2 for the amplitude estimation, which yields the query complexity of

$\begin{matrix} {{\left( {\sqrt{\frac{N_{h}}{z_{h}}}\frac{{{H(\theta)}}d}{\epsilon}\left( {n^{2} + {n{\log \left( \frac{{H(\theta)}}{\epsilon} \right)}} + {n\; {\log \left( \frac{1}{\epsilon} \right)}} + {{\log \left( \frac{{H(\theta)}}{\epsilon} \right)}{\log \left( \frac{1}{\epsilon} \right)}}} \right)} \right)},\mspace{20mu} {\in {{\overset{\sim}{O}\left( {\sqrt{\frac{N_{h}}{z_{h}}}\left( \frac{n^{2}{{H(\theta)}}_{1}d}{\epsilon} \right)} \right)}.}}} & (58) \end{matrix}$

where one denotes with A the hidden subsystem with dimensionality 2^(n) ^(h) ≤2^(N), on which the Gibbs state is prepared, and with B, the subsystem for the trash state.

Similarly, the evaluation of the second part in Eq. 45 requires the Gibbs state preparation for H, the Hadamard test, and phase estimation. Similar to the above it is necessary to take the error into account. Letting the purified version of the Gibbs state for H be given by |ψ

_(Gibbs), which is obtained using Theorem 1, and letting σ_(Gibbs) be the perfect state, then the error is given by

$\begin{matrix} {{{{{Tr}_{AB}\left\lbrack {\left( {v_{k} \otimes h_{k} \otimes I} \right){\psi\rangle}_{Gibbs}{\langle\psi }_{Gibbs}} \right\rbrack} - {{Tr}_{A}\left\lbrack {\left( {v_{k} \otimes h_{k}} \right)\sigma_{Gibbs}} \right\rbrack}} = {{{Tr}_{A}\left\lbrack {{\left( {v_{k} \otimes h_{k}} \right){{Tr}_{B}\left\lbrack {{\psi\rangle}_{Gibbs}^{V}{\langle\psi }_{gibbs}^{V}} \right\rbrack}} - {\left( {v_{k} \otimes h_{k}} \right)\sigma_{Gibbs}}} \right\rbrack} \leq {\sum\limits_{i}{{\sigma_{i}\left( {v_{k} \otimes h_{k}} \right)}{{{{Tr}_{B}\left\lbrack {{\psi\rangle}_{Gibbs}^{V}{\langle\psi }_{Gibbs}^{V}} \right\rbrack} - {h_{k}\sigma_{Gibbs}}}}}} \leq {\overset{\sim}{\epsilon}{\sum\limits_{i}{\sigma_{i}\left( {v_{k} \otimes h_{k}} \right)}}}}},} & (59) \end{matrix}$

where in this case A is the subsystem of the visible and hidden subspace and B the trash system. The upper bound on the error is set as above and introducing ξ:=max[N/z, N_(h)/z_(h)], one can find that a uniform bound on the query complexity for evaluating

$\begin{matrix} {{\overset{\sim}{O}\left( {\sqrt{\zeta}\left( \frac{n^{2}{{H(\theta)}}_{1}d}{\epsilon} \right)} \right)},} & (60) \end{matrix}$

thus one attains the claimed query complexity by repeating the above procedure for each of the D components of the estimated gradient vector S.

Note that it is also necessary to evaluate the terms Tr [ρv_(k)] to precision {circumflex over (ϵ)}≤ϵ, which though only incurs an additive cost of D/ϵ to the total query complexity, since this step is required to be performed once. Note that |

_(h)(h_(p))|≤1 because h_(p) is assumed to be unitary. To complete the proof it is necessary only to take the success probability of the amplitude estimation process into account. For completeness, the algorithm is stated in the Appendix and herein only Theorem 5 is referred to, from which it follows that the procedure succeeds with probability at least 8/π². In order to have a failure probability of the final algorithm of less than ⅓, it is necessary to repeat the procedure for all d dimensions of the gradient and to take the median. The number of repetitions may now be bounded in the following way.

Let n_(ƒ) be the number of instances of the gradient estimate such that the error is larger than ϵ, and let n_(s) be the number of instances with an error ≤ϵ for one dimension of the gradient, and let the result taken be the median of the estimates, where n=n_(s)+n_(ƒ) samples are collected. The algorithm gives a wrong answer for each dimension if

${n_{s} \leq \left\lfloor \frac{n}{2} \right\rfloor},$

since then the median is a sample such that the error is not bound by ϵ. Let p=8/π² be the success probability to draw a positive sample, as is the case of the amplitude estimation procedure. Since each instance of the phase estimation algorithm will independently return an estimate, the total failure probability is given by the union bound, i.e.,

$\begin{matrix} {{{Pr_{fail}} \leq {D \cdot {\Pr \left\lbrack {n_{s} \leq \left\lfloor \frac{n}{2} \right\rfloor} \right\rbrack}} \leq {D \cdot q^{{- \frac{n}{2p}}{({p - \frac{1}{2}})}^{2}}} \leq \frac{1}{3}},} & (61) \end{matrix}$

which follows from the Chernoff inequality for a binomial variable with p>½, which is given in the present case. Therefore, by taking

${{n \geq {\frac{2p}{\left( {p - {1/2}} \right)^{2}}{\log \left( {3D} \right)}}} = {{\frac{16}{\left( {8 - {\pi^{2}/2}} \right)^{2}}{\log \left( {3D} \right)}} = {O\left( {\log \left( {3D} \right)} \right)}}},$

a total failure probability of at most ⅓ is achieved. This is sufficient to demonstrate the validity of the algorithm if

Tr[ρ{tilde over (H)} _(h)]  (62)

is known exactly. This is difficult to do because the probability distribution α_(h) is not usually known apriori. As a result, it is assumed that the distribution will be learned empirically and to do so it is necessary to draw samples from the purified Gibbs states used as input. This sampling procedure will incur errors. To take such errors into account, assume that it is possible to obtain estimates T_(h) of Eq. 62 with precision δ_(t), i.e.,

|T _(h)−Tr[ρ{tilde over (H)} _(h)]≤δ_(t).  (63)

Under this assumption, it is now possible to bound the distance |αh−{tilde over (α)}_(h)| in the following way. Observe that

$\begin{matrix} {{\left| {\alpha_{h} - {\overset{\sim}{\alpha}}_{h}} \right| = {{{\frac{e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}} - \frac{T_{h}}{\Sigma_{h}T_{h}}}} \leq {{{\frac{e^{{- T}{r{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}} - \frac{T_{h}}{\Sigma_{h}e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}}}} + {{\frac{T_{h}}{\Sigma_{h}e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}} - \frac{T_{h}}{\Sigma_{h}T_{h}}}}}}},} & (64) \end{matrix}$

and hence it is necessary to bound the following two quantities in order to bound the error. First, a bound is required on

e ^(−Tr[ρ{tilde over (H)}) ^(h) ^(]) −e ^(−T) ^(h) |.  (65)

For this, let ƒ(s):=T_(h)(1−s)+Tr [ρ{tilde over (H)}_(h)] s, such that Eq. 65 can be rewritten as

$\begin{matrix} \begin{matrix} {{{e^{- {f{(1)}}} - e^{- {f{(0)}}}}} = {{\int_{0}^{1}{\frac{d}{ds}e^{- {f{(s)}}}{ds}}}}} \\ {= {{\int_{0}^{1}{{\overset{.}{f}(s)}e^{- {f{(s)}}}{ds}}}}} \\ {= {{\int_{0}^{1}{\left( {{T{r\left\lbrack {\rho {\overset{\sim}{H}}_{h}} \right\rbrack}} - T_{h}} \right)e^{- {f{(s)}}}{ds}}}}} \\ {\leq {\delta e^{{- m}\; i\; n_{s}{f{(s)}}}}} \\ {\leq {\delta e^{{{- T}{r{\lbrack{\rho H_{h}}\rbrack}}} + \delta}}} \end{matrix} & (66) \end{matrix}$

and assuming δ≤log(2), this reduces to

|e ^(−ƒ(1)) −e ^(−ƒ(0))|≤2δe ^(−Tr[ρ{tilde over (H)}) ^(h) ^(]).  (67)

Second,

$\begin{matrix} {{{{\sum\limits_{h}e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}} - {\sum\limits_{h}T_{h}}}} \leq {2\delta {\sum\limits_{h}{e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}.}}}} & (68) \end{matrix}$

Using this, Eq. 64 can be bounded above by

$\begin{matrix} {{{\frac{2\delta \; e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}{\sum_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}} + {{T_{h}}{{\frac{1}{\sum_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}} - \frac{1}{\left( {1 - {2\delta}} \right){\sum_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}}}}}} \leq {\frac{2\delta \; e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{n}}\rbrack}}}}{\sum_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}} + \frac{4\delta {T_{h}}}{\sum_{h}e^{- {{Tr}{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}}},} & (69) \end{matrix}$

where δ≤¼ is applied. Note that

4ϵ|T _(h)|≤4δ(e ^(−Tr[ρ{tilde over (H)}) ^(h) ^(])+2δe ^(−Tr[ρ{tilde over (H)}) ^(h) ^(]))=e ^(−Tr[ρ{tilde over (H)}) ^(h) ^(])(4δ+8δ²)≤e ^(−Tr[ρ{tilde over (H)}) ^(h) ^(])(4δ+2δ)≤6δe ^(−Tr[ρ{tilde over (H)}) ^(H) ^(]),  (70)

which leads to a final error of

$\begin{matrix} {{{\alpha_{h} - {\overset{\sim}{\alpha}}_{h}}} \leq {8\delta {\frac{e^{{- T}{r{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{{- T}{r{\lbrack{\rho \; {\overset{\sim}{H}}_{h}}\rbrack}}}}.}}} & (71) \end{matrix}$

With this it is now possible to bound the error in the expectation w.r.t. the faulty distribution for some function ƒ(h) to be

$\begin{matrix} {{{{_{h}\left( {f(h)} \right)} - {{\overset{\sim}{}}_{h}\left( {f(h)} \right)}}} \leq {8\delta {\sum\limits_{h}\frac{{f(h)}e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}}{\Sigma_{h}e^{{- T}{r{\lbrack{\rho {\overset{\sim}{H}}_{h}}\rbrack}}}}}} \leq {8\underset{h}{\delta max}{{f(h)}.}}} & (72) \end{matrix}$

This can now be used in order to estimate the error introduced in the first term of Eq. 45 through errors in the distribution {α_(h)} as

$\begin{matrix} {{{{{{_{h}\left\lbrack E_{hp} \right\rbrack}T{r\left\lbrack {\rho v_{p}} \right\rbrack}} - {\overset{\sim}{}\left\lbrack {E_{hp}T{r\left\lbrack {\rho v_{p}} \right\rbrack}} \right\rbrack}}} \leq {8\underset{h}{\delta max}{{E_{hp}T{r\left\lbrack {\rho v_{p}} \right\rbrack}}}} \leq {8\delta {\max\limits_{h,p}{E_{hp}}}}},} & (73) \end{matrix}$

where in the last step the unitarity of v_(k) and the Von-Neumann trace inequality was used. For an final error of ϵ, δ_(t)=ϵ/[16 max_(h,p)|E_(h,p)] is therefore chosen, to ensure that this sampling error incurrs at most half the error budget of ϵ. Thus it is ensured that δ≤¼ if ϵ≤4 max_(h,p)|E_(h,p)|.

It is possible to improve the query complexity of estimating the above expectation by values by using amplitude amplification, since one obtains the measurement via a Hadamard test. For this case only O(max_(h,p)|E_(h,p)|/ϵ) samples are required in order to achieve the desired accuracy from the sampling. Noting that it may not be possible to even access {tilde over (H)}_(h) without any error, it is deduced that the error of the individual terms of {tilde over (H)}_(h) for an ϵ-error in the final estimate must be bounded by δ_(t)v∥θ∥₁, where with abuse of notation, δ_(t) now denotes the error in the estimates of E_(h,k). Even taking that into account, the evaluation of this contribution is dominated by the second term, and hence can be neglected in the analysis.

Theorem 2 shows that the computational complexity of estimating the gradient grows the closer one approaches a pure state, since for a pure state the inverse temperature β→∞, and therefore the norm ∥H(θ)∥→∞, as the Hamiltonian is depending on the parameters, and hence the type of state described. In such cases one typically would rely on alternative techniques. However, this cannot be generically improved because otherwise it would be possible to find minimum energy configurations using a number of queries in o(√{square root over (N)}), which would violate lower bounds for Grover's search. Therefore more precise statements of the complexity will require further restrictions on the classes of problem Hamiltonians to avoid lower bounds imposed by Grover's search and similar algorithms.

Returning again to the drawings, FIGS. 6A and 6B illustrate an example method 60A to estimate the gradient of the quantum relative entropy of a restricted QBM having visible and hidden nodes. In a restricted QBM, the Hamiltonian terms acting on the hidden units mutually commute by definition herein. As evident from Eq. 45, the estimated gradient may be computed as a difference of two terms, the first term relating to the training distribution and the second term relating to the quantum state of the visible nodes. Accordingly, FIG. 6A illustrates aspects of method 60A related to computation of the first term, while FIG. 6B illustrates aspects of method 60A related to computation of the second term. Method 60A may be employed as a particular instance of step 60 in the training method of FIG. 5. Each step of this method is developed in detail in the description above; accordingly, the present description provides only summary detail to enable the reader to understand the process flow in one non-limiting example.

In method 60A, estimation of the gradient of the quantum relative entropy of a QBM includes computing a variational upper bound on the quantum relative entropy, according to the following algorithm. Beginning in FIG. 6A, at 70 of method 60A, the trace Tr [ρv_(k)] computed for all k∈D is passed to a Gibbs-state preparation method (vide supra). At 72, biases θ_(k) and operator h_(k) are also passed to the Gibbs-state preparation method. At 74 the Gibbs-state preparation method is executed, resulting in population of the plurality of qubits of the quantum computer with a purified Gibbs state for Hamiltonians H_(h). In this method, estimating the gradient includes using substantially commuting operators (i.e., having a commutator which is small or negligible in comparison to each operator) to assign an energy penalty to each qubit corresponding to a hidden node of the QBM.

At 76 of method 60A, a control loop is encountered wherein an ancilla qubit is prepared in the state |+

=|0

+|1

. At 78 a controlled h_(k) operation is performed, using the ancilla qubit prepared at 76 as a control. At 82 a Hadamard gate is applied to the ancilla qubit. Then, at 84, the amplitude of the ancilla qubit state is estimated on the 10) state. At this point in the method, the confidence analysis described above is applied in order to determine whether additional measurements are required to achieve precision ϵ. If so, execution returns to 76. Otherwise, the product <E_(h,p)>_(h) Tr [ρv_(k)] of the expectation value and the trace is evaluated and returned.

Turning now to FIG. 6B, at 90 of method 60A, the visible state v_(k) is passed to the Gibbs-state preparation method. At 92, biases θ_(k) and operator h_(k) are also passed to the Gibbs-state preparation method. At 94 the Gibbs-state preparation method is executed, thereby populating the plurality of qubits of the quantum computer with a purified Gibbs state for Hamiltonian H.

At 96 of method 60A, a control loop is encountered wherein an ancilla qubit is prepared in the state |+

=|0

+|1

. At 100, a controlled v_(k)⊗h_(k) operation is applied, using the ancilla qubit as a control, prior to application at 102 of a Hadamard gate on the ancilla qubit. Then, at 104, the amplitude of the ancilla qubit state is estimated on the |0

state. At this point in the method, the confidence analysis described above is applied in order to determine whether additional measurements are required to achieve precision ϵ. If so, execution returns to 96. Otherwise, the resulting expectation value is evaluated and returned.

Approach 2: Training with Higher Order Divided Differences and Function Approximations

This section describes a scheme to train a QBM using divided difference estimates for the relative entropy error and to generate differentiation formulas by differentiating and interpolating. First an interpolating polynomial is constructed from the data. Second, an approximation of the derivative at any point can be obtained by a direct differentiation of the interpolant. In the following it is assumed that it will be possible to simulate and evaluate Tr [ρ log σ_(v)]. As this is generally non-trivial, and the error is typically large, in the next section is proposed a different, more specialised approach that, however, still allows training of arbitrary models with the relative entropy objective.

In order to proof the error of the gradient estimation via interpolation, it is necessary to first establish error bounds on the interpolating polynomial which can be obtained via the remainder of the Lagrange interpolation polynomial. The gradient error for the objective can then be obtained by as a combination of this error with a bound on the n+1-st order derivative of the objective. The first step is to bound the error in the polynomial approximation.

Lemma 2.

Let ƒ(θ) be the n+1 times differentiable function for which we want to approximate the gradient and let p_(n)(θ) be the degree n Lagrange interpolation polynomial for points {θ₁, θ₂, . . . , θ_(k), . . . , θ_(n)}. The gradient evaluated at point θ_(k) is then given by the interpolation polynomial

$\begin{matrix} {{\frac{\partial{p\left( \theta_{k} \right)}}{\partial\theta} = {\sum\limits_{j = 0}^{n}{{f\left( \theta_{j} \right)}{\mathcal{L}_{n,j}^{\prime}\left( \theta_{k} \right)}}}},} & (74) \end{matrix}$

where

_(n,j)′ is the derivative of the Lagrange interpolation polynomials

$\begin{matrix} {{{\mathcal{L}_{\mu,j}(\theta)}:={\prod\limits_{\underset{k \neq j}{k = 0}}^{\mu}\frac{\theta - \theta_{k}}{\theta_{j} - \theta_{k}}}},} & \; \end{matrix}$

and the error is given by

$\begin{matrix} {{{{\frac{\partial{f\left( \theta_{k} \right)}}{\partial\theta} - \frac{\partial{p_{n}\left( \theta_{k} \right)}}{\partial\theta}}} \leq {\frac{1}{\left( {n + 1} \right)!}{{{f^{({n + 1})}\left( {\xi \left( \theta_{k} \right)} \right)}{\prod\limits_{\underset{j \neq k}{j = 0}}^{n}\left( {\theta_{j} - \theta_{k}} \right)}}}}},} & (75) \end{matrix}$

where ξ(θ_(k)) is a constant depending on the point θ_(k) at which the gradient is evaluated, and where ƒ^((i)) denotes the i-th derivative of ƒ. Note that θ is a point within the set of points at which evaluation is attempted.

Proof.

Recall that the error for the degree n Lagrange interpolation polynomial is given by

$\begin{matrix} {{{{f(\theta)} - {p_{n}(\theta)}} \leq {\frac{1}{\left( {n + 1} \right)!}{f^{({n + 1})}\left( \xi_{\theta} \right)}{w(\theta)}}},} & (76) \end{matrix}$

where

${w(\theta)}:={\prod\limits_{j = 1}^{n}{\left( {\theta - \theta_{j}} \right).}}$

It is necessary to estimate the gradient of this, and hence to evaluate

$\begin{matrix} {{\frac{\partial{f(\theta)}}{\partial\theta} - \frac{\partial{p_{n}(\theta)}}{\partial\theta}} \leq {\lim\limits_{\Delta->0}{\left( \frac{{\frac{1}{\left( {n + 1} \right)!}{f^{({n + 1})}\left( \xi_{\theta + \Delta} \right)}{w\left( {\theta + \Delta} \right)}} - {\frac{1}{\left( {n + 1} \right)!}{f^{({n + 1})}\left( \xi_{\theta} \right)}{w(\theta)}}}{\Delta} \right).}}} & (77) \end{matrix}$

Now, since it is not necessary to estimate the gradient at an arbitrary point θ, but sufficient to estimate the gradient at a chosen point, it is possible to set θ to be one of the points at which the function ƒ(θ), i.e., θ∈{θ_(i)}_(i=1) ^(n), is evaluated. Let this choice be given by θ_(k), arbitrarily chosen. Then the latter term vanishes since w(θ_(k))=0. Therefore,

$\begin{matrix} {{{\frac{\partial{f\left( \theta_{k} \right)}}{\partial\theta} - \frac{\partial{p_{n}\left( \theta_{k} \right)}}{\partial\theta}} \leq {\lim\limits_{\Delta\rightarrow 0}\left( \frac{\frac{1}{\left( {n + 1} \right)!}{f^{({n + 1})}\left( \xi_{\theta_{k} + \Delta} \right)}{w\left( {\theta_{k} + \Delta} \right)}}{\Delta} \right)}},} & (78) \end{matrix}$

and noting that w(θ_(k)) contains one term (θ_(k)+Δ−θ_(k))=Δ achieves the claimed result.

A number of approximation steps will be performed in order to obtain a form which can be simulated on a quantum computer more efficiently, and only then resolve to divided differences at this “lower level”. In detail the following steps are performed. Recalling the need to evaluate the gradient of

${Tr}{\left\lbrack {{\frac{\partial}{\partial\theta}\rho}{\log \left( \sigma_{v} \right)}} \right\rbrack,}$

-   -   1. The logarithm is first approximated via a Fourier-like         approximation, i.e.,

log σ_(v)→log_(K,M)σ_(v),  (79)

-   -   -   similar to [3], which will yield a Fourier-like series in             terms of σ_(v), i.e., Σ_(m)c_(m) exp (imπσ_(v)).

    -   2. Next, it is necessary to evaluate the gradient of the         function

${Tr}{\left\lbrack {{\frac{\partial}{\partial\theta}\rho}{\log_{K,M}\left( \sigma_{v} \right)}} \right\rbrack,}$

where log(σ_(v)) is now approximated with the truncated Fourier series up to order K, using an additional parameter M which will be explained below. Taking the derivative yields many terms of the form

$\begin{matrix} {{\int_{0}^{1}{dse^{({ism\pi \sigma_{v}})}\frac{\partial\sigma_{v}}{\partial\theta}e^{({{i{({1 - s})}}m\pi \sigma_{v}})}}},} & (80) \end{matrix}$

-   -   -   as a result of the Duhamel's formula. Note that this             derivative is exact except the approximation error of the             logarithm. Each term in this expansion can furthermore be             evaluated separately via a sampling procedure, i.e.,

$\begin{matrix} {{{\int_{0}^{1}{dse^{({ism\pi \sigma_{v}})}\frac{\partial\sigma_{v}}{\partial\theta}e^{({{i{({1 - s})}}m\pi \sigma_{v}})}}} \approx {_{s}\left\lbrack {e^{({ism\pi \sigma_{v}})}\frac{\partial\sigma_{v}}{\partial\theta}e^{({{i{({1 - s})}}m\pi \sigma_{v}})}} \right\rbrack}},} & (81) \end{matrix}$

-   -   -   and there is now only a logarithmic number of terms, so the             result can be combined via classical postprocessing once the             trace is evaluated.

    -   3. Next, a divided difference scheme is applied to approximate         the gradient

$\begin{matrix} {\frac{\partial\sigma_{v}}{\partial\theta},} & \; \end{matrix}$

which results in a polynomial of order n (i.e., the number of points at which the function is evaluated) in σ_(v), which we can be evaluated efficiently.

-   -   4. However, evaluating these terms is still not trivial. The         final step consists hence of implementing a routine that allows         evaluation of these terms on a quantum device. In order to do         so, one again makes use of the Fourier series approach. Applied         this time is the simple idea of approximating the density         operator σ_(v) by the series of itself, i.e.,         σ_(v)≈F(σ_(v)):=Σ_(m′)c_(m′)exp (imπm′σ_(v)), which can be         implemented conveniently via sample based Hamiltonian simulation         [ ].

The following provides concrete bounds on the error introduced by the approximations and details of the implementation. The final result is then stated in Theorem 4. First one bounds the error in the approximation of the logarithm and then uses Lemma 37 of [3] to obtain a Fourier series approximation which is close to log(z). The Taylor series of

for x∈(0,1) and where

$\begin{matrix} {\begin{matrix} {{\log (x)} = {\sum\limits_{k = 1}^{\infty}{\left( {- 1} \right)^{k + 1}\frac{\left( {x - 1} \right)^{k}}{k}}}} \\ {{= {{\sum\limits_{k = 1}^{K_{1}}{\left( {- 1} \right)^{k + 1}\frac{\left( {x - 1} \right)^{k}}{k}}} + {R_{K_{1} + 1}\left( {x - 1} \right)}}},} \end{matrix}\quad} & (82) \end{matrix}$

for x 6 (0.1) and where

${R_{K + 1}(z)} = {\frac{f^{K_{1} + 1}(c)}{K!}\left( {z - c} \right)^{K_{1}}z}$

is the Cauchy remainder of the Taylor series, for −1<z<0. The error can hence be bounded as

$\begin{matrix} {{{{R_{K_{1} + 1}(z)}} = {{\left( {- 1} \right)^{K_{1}}\frac{{z^{K_{1} + 1}\left( {1 - \alpha} \right)}^{K_{1}}}{\left( {1 + {\alpha z}} \right)^{K_{1} + 1}}}}},} & (83) \end{matrix}$

where the derivatives of the logarithm are evaluated, and where 0≤α≤1 is a parameter. Using that 1+αz≥1+z (since z≤0) and hence

${0 \leq \frac{1 - \alpha}{1 + {\alpha z}} \leq 1},$

the error bound becomes

$\begin{matrix} {{{R_{K_{1} + 1}(z)}} \leq \frac{{z}^{K_{1} + 1}}{1 + z}} & (84) \end{matrix}$

Reversing to the variable x the error bound for the Taylor series, and assuming that 0<δ_(l)<z and 0<|−z|≤δ_(u)<1, which is justified when dealing with sufficiently mixed states, then the approximation error is given by

$\begin{matrix} {{{R_{K_{1} + 1}(z)}} \leq \frac{\left( \delta_{l} \right)^{K_{1} + 1}}{\delta_{u}}\overset{!}{\leq}{\epsilon_{1}.}} & (85) \end{matrix}$

Hence in order to achieve the desired error ϵ₁ we need

$\begin{matrix} {K_{1} \geq {\frac{\log \left( \left( {\epsilon_{1}\delta_{u}} \right)^{- 1} \right)}{\log \left( \left( \delta_{l} \right)^{- 1} \right)}.}} & (86) \end{matrix}$

Hence, it is possible to choose K₁ such that the error in the approximation of the Taylor series is ≤ϵ₁/4. This implies the ability to make use of Lemma 37 of [3], and therefore to obtain a Fourier series approximation for the logarithm. This Lemma is now restated here for completeness:

Lemma 3.

(Lemma 37 in [3]) Let ƒ:

→

and δ,ϵ∈(0,1), and T(ƒ):=Σ_(k=0) ^(K)a_(k)x^(k) be a polynomial such that |ƒ(x)−T(ƒ)|≤ϵ/4 for all x∈[−1+δ,1−δ]. Then ∃c∈

^(2M+1).

$\begin{matrix} {{{{f(x)} - {\sum\limits_{m = {- M}}^{M}{c_{m}e^{\frac{i\; \pi \; m}{2}x}}}}} \leq \epsilon} & (87) \end{matrix}$

for all x∈[−1+δ,1−], where

$M = {\max \left( {{2\left\lceil {{\ln \left( \frac{4{a}_{1}}{\epsilon} \right)}\frac{1}{\delta}} \right\rbrack},0} \right)}$

and ∥c∥₁≤∥a∥₁. Moreover, c can be efficiently calculated on a classical computer in time poly(K,M,log(1/ϵ)).

In order to apply this lemma to the present case, the approximation rate is now restricted to the range (δ_(l),δ_(u)), where 0<δ_(l)≤δ_(u)<1. Therefore over this range an approximation of the following form is obtained.

Corollary 1.

Let ƒ:

→

be defined as ƒ(x)=log(x), δ,ϵ₁∈(0,1), and

${\log_{K}\left( {1 - x} \right)}:={\sum_{k = 1}^{K_{1}}{\frac{\left( {- 1} \right)^{k - 1}}{k}x^{k}}}$

such that

${a_{k}:} = \frac{\left( {- 1} \right)^{k - 1}}{k}$

and

${a}_{1} = {\sum_{k = 1}^{K_{1}}\frac{1}{k}}$

with

$K_{1} \geq \frac{\log \left( {4\left( {\epsilon_{1}\delta_{1}^{u}} \right)^{- 1}} \right)}{\log \left( \left( \delta_{l} \right)^{- 1} \right)}$

such that |log(x)−log_(K)(x)|≤ϵ₁/4 for all x∈[δ_(l),δ_(u)]. Then ∃c∈

^(2M+1).

$\begin{matrix} {{{{f(x)} - {\sum\limits_{m = {- M_{1}}}^{M_{1}}{c_{m}e^{\frac{i\; \pi \; m}{2}x}}}}} \leq \epsilon_{1}} & (88) \end{matrix}$

for all x∈[δ_(l),δ_(u)], where

$M_{1} = {\max \left( {{2\left\lceil {{\ln \left( \frac{4{a}_{1}}{\epsilon_{1}} \right)}\frac{1}{1 - \delta_{u}}} \right\rceil},0} \right)}$

and ∥c∥₁≤∥a∥₁. Moreover, c can be efficiently calculated on a classical computer in time poly(K₁, M₁, log(1/ϵ₁)).

Proof.

The proof follows straightforwardly by combining Lemma 3 with the approximation of the logarithm and the range over which we want to approximate the function.

In the following

${{\log_{K,M}(x)}:={\sum\limits_{m = {- M_{1}}}^{M_{1}}{c_{m}e^{\frac{i\pi m}{2}x}}}},$

where the K-subscript is retained to denote that classical computation of this approximation is poly(K)-dependent. Now the gradient of the objective is expressed via this approximation as

$\begin{matrix} {{Tr}{\left\lbrack {{\frac{\partial}{\partial\theta}\rho}\log_{K,M}\sigma_{v}} \right\rbrack \approx {\sum\limits_{m = {- M_{1}}}^{M_{1}}{\frac{ic_{m}m\pi}{2}{\int_{0}^{1}{{{dsTr}\left\lbrack {\rho \; e^{\frac{is\pi m}{2}\sigma_{v}}\frac{\partial\sigma_{v}}{\partial\theta}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{v}}} \right\rbrack}.}}}}}} & (89) \end{matrix}$

where each term in the sum may be evaluated individually and the results classically post-processed, i.e., sumed up. In particular the latter can be evaluated as the expectation value over s, i.e.,

$\begin{matrix} {{{\int_{0}^{1}{{dsTr}\left\lbrack {\rho \; e^{\frac{is\pi m}{2}\sigma_{v}}\frac{\partial\sigma_{v}}{\partial\theta}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{v}}} \right\rbrack}} = {_{s \in {\lbrack{0,1}\rbrack}}\left\lbrack {{Tr}\left\lbrack {\rho \; e^{\frac{is\pi m}{2}\sigma_{v}}\frac{\partial\sigma_{v}}{\partial\theta}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{v}}} \right\rbrack} \right\rbrack}},} & (90) \end{matrix}$

which can be evaluated separately on a quantum device. In the following it is necessary to devise a method to evaluate this expectation value.

First, the gradient is expanded using a divided difference formula such that

$\frac{\partial\sigma_{v}}{\partial\theta}$

is approximated by the Lagrange interpolation polynomial of degree μ−1, i.e.,

$\begin{matrix} {{{\frac{\partial\sigma_{v}}{\partial\theta}(\theta)} \approx {\sum\limits_{j = 0}^{\mu}{{\sigma_{v}\left( \theta_{j} \right)}{\mathcal{L}_{\mu,j}^{\prime}(\theta)}}}},{where}} & (91) \\ {{\mathcal{L}_{\mu,j}(\theta)}:={\prod\limits_{\underset{k \neq j}{k = 0}}^{\mu}{\frac{\theta - \theta_{k}}{\theta_{j} - \theta_{k}}.}}} & (92) \end{matrix}$

Note that the order p is free to chose, and will guarantee a different error in the solution of the gradient estimate as described prior in Lemma 2. Using this in the gradient estimation, a polynomial of the form is obtained (evaluated at θ_(j), i.e., the chosen points)

$\begin{matrix} {{\sum\limits_{m = {- M_{1}}}^{M_{1}}{\frac{ic_{m}m\pi}{2}{\sum\limits_{j = 0}^{\mu}{{\mathcal{L}_{\mu,j}^{\prime}\left( \theta_{j} \right)}{_{s \in {\lbrack{0,1}\rbrack}}\left\lbrack {{Tr}\left\lbrack {\rho \; e^{\frac{is\pi m}{2}\sigma_{v}}{\sigma_{v}\left( \theta_{j} \right)}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{v}}} \right\rbrack} \right\rbrack}}}}},} & (93) \end{matrix}$

where each term again can be evaluated separately, and efficiently combined via classical post-processing. Note that the error in the Lagrange interpolation polynomial decreases exponentially fast, and therefore the number of terms we use is sufficiently small to do so. Next, it is necessary to deploy a method to evaluate the above expressions. In order to do so, σ_(v) is implemented as a Fourier series of itself, i.e., σ_(v)=arcsin(sin(σ_(v)π/2)/(π/2)), which will then be approximated in a similar approach as taken in Lemma 3. With this the following result is obtained.

Lemma 4.

Let δ, ϵ₂∈(0,1), and {tilde over (x)}:=Σ_(m′=−M) ₂ ^(M) ² {tilde over (c)}_(m′)e^(iπm′x/2) with

$K_{2} \geq {\frac{\log \left( {4/\epsilon_{2}} \right)}{\log \left( \delta_{u}^{- 1} \right)}\mspace{14mu} {and}}$ $M_{2} \geq \left\lceil {{\log \left( \frac{4}{\epsilon_{2}} \right)}\sqrt{\left( {2\log \delta_{u}^{- 1}} \right)^{- 1}}} \right\rceil$

and x∈[δ_(l),δ_(u)]. Then ∃{tilde over (c)}∈

^(2M+1):

|x−{tilde over (x)}|≤ϵ ₂  (94)

for all x∈[δ_(l),δ_(u)], and ∥c∥₁≤1. Moreover, {tilde over (c)} can be efficiently calculated on a classical computer in time poly(K₂, M₂, log(1/ϵ₂)).

Proof.

Invoking the technique used in [3], we expand

$\begin{matrix} {{{\arcsin (z)} = {{\sum\limits_{k = 0}^{K_{2}}{2^{{- 2}k^{\prime}}\begin{pmatrix} {2k^{\prime}} \\ k \end{pmatrix}\frac{z^{{2k^{\prime}} + 1}}{{2k^{\prime}} + 1}}} + {R_{K_{2} + 1}(z)}}},} & (95) \end{matrix}$

where R_(K) ₂ ₊₁ is the remainder as before. For 0<z≤δ_(u)≤½, remainder can be bound by

${{R_{K_{2} + 1}} \leq \frac{{\delta_{u}}^{K_{2} + 1}}{1/2}\overset{!}{\leq}{ɛ_{2}/2}},$

which gives the bound

$\begin{matrix} {{K_{2} \geq \frac{\log \left( {4/\epsilon_{2}} \right)}{\log \left( \delta_{u}^{- 1} \right)}}.} & (96) \end{matrix}$

By approximation,

$\begin{matrix} {{\sin^{l}(x)} = {\left( \frac{i}{2} \right)^{l}{\sum\limits_{m^{\prime} = 0}^{l}{\left( {- 1} \right)^{m^{\prime}}\begin{pmatrix} l \\ m^{\prime} \end{pmatrix}e^{i{x{({{2m^{\prime}} - l})}}}}}}} & (97) \end{matrix}$

by

$\begin{matrix} {{{\sin^{l}(x)} \approx {\left( \frac{i}{2} \right)^{l}{\sum\limits_{m^{\prime} = {{\lceil{l/2}\rceil} - M_{2}}}^{{\lfloor{l/2}\rfloor} + M_{2}}{\left( {- 1} \right)^{m^{\prime}}\begin{pmatrix} l \\ m^{\prime} \end{pmatrix}e^{i{x{({{2m^{\prime}} - l})}}}}}}},} & (98) \end{matrix}$

which induces an error of ϵ₂/2 for the choice

$\begin{matrix} {{M_{2} \geq \left\lceil {{\log \left( \frac{4}{\epsilon_{2}} \right)}\sqrt{\left( {2\log \delta_{u}^{- 1}} \right)^{- 1}}} \right\rceil}.} & (99) \end{matrix}$

This can be seen by using Chernoff's inequality for sums of binomial coefficients, i.e.,

$\begin{matrix} {{{\sum\limits_{m^{\prime} = {\lceil{{l/2} + M_{2}}\rceil}}^{l}{2^{- l}\begin{pmatrix} l \\ m^{\prime} \end{pmatrix}}} \leq e^{- \frac{2M_{2}^{2}}{l}}},} & (100) \end{matrix}$

and choosing M appropriately. Finally, defining ƒ(z):=arcsin(sin(zπ/2)/(π/2)), as well as {tilde over (ƒ)}₁:=Σ_(k′=0) ^(K) ² b_(k′) sin^(2k′+1)(zπ/2) and

$\begin{matrix} {{{{{\overset{\sim}{f}}_{2}(z)}:} = {\sum\limits_{k^{\prime} = 0}^{K_{2}}{{b_{k^{\prime}}\left( \frac{i}{2} \right)}^{l}{\sum\limits_{m^{\prime} = {{\lceil{l/2}\rceil} - M_{2}}}^{{\lfloor{l/2}\rfloor} + M_{2}}{\left( {- 1} \right)^{m^{\prime}}\begin{pmatrix} l \\ m^{\prime} \end{pmatrix}e^{i{x{({{2m^{\prime}} - l})}}}}}}}},} & (101) \end{matrix}$

and observing that

∥ƒ−{tilde over (ƒ)}₂∥_(∞)≤∥ƒ−{tilde over (ƒ)}₁∥_(∞)+∥{tilde over (ƒ)}₁−{tilde over (ƒ)}₂∥_(∞),  (102)

yields the final error of ϵ₂ for the approximation z≈{tilde over (z)}=Σ_(m′){tilde over (c)}_(m′)e^(iπm′z/2).

Note that this immediately leads to an ϵ₂ error in the spectral norm for the approximation

$\begin{matrix} {{{{\sigma_{\upsilon} - {\sum\limits_{m^{\prime} = {- M_{2}}}^{M_{2}}{{\overset{\sim}{c}}_{m^{\prime}}e^{i\; \pi \; m^{\prime}{\sigma_{\upsilon}/2}}}}}}_{2} \leq \epsilon_{2}},} & (103) \end{matrix}$

where σ_(v) is the reduced density matrix.

Since the final goal is to estimate Tr [∂_(θ)ρ log σ_(v)], with a variety of σ_(v)(θ_(j)) using the divided difference approach, it is also necessary to bound the error in this estimate which is now introduced with the above approximations. Bounding the derivative with respect to the remainder can be done by using the truncated series expansion and bounding the gradient of the remainder. This yields the following result.

Lemma 5.

For the of the parameters M₁, M₂, K₁, L, μ, Δ, s given in Eq. (143-150), and ρ, σ_(v) being two density matrices, we can estimate the gradient of the relative entropy such that

|∂_(θ)Tr[ρ log σ_(v)]−∂_(θ)Tr[ρ log_(K) ₁ _(,M) ₁ {tilde over (σ)}_(v)]|≤ϵ,  (104)

where the function ∂_(θ)Tr [ρ log_(K) ₁ _(,M) ₁ {tilde over (σ)}_(v)] evaluated at θ is defined as

$\begin{matrix} {{Re}\left\lbrack {\sum\limits_{m = {- M_{1}}}^{M_{1}}{\sum\limits_{m^{\prime} = {- M_{2}}}^{M_{2}}{\frac{ic_{m}{\overset{\sim}{c}}_{m^{\prime}}m\; \pi}{2}{\sum\limits_{j = 0}^{\mu}{{\mathcal{L}_{\mu,j}^{\prime}(\theta)}{_{s \in {\lbrack{0,1}\rbrack}}\left\lbrack {{Tr}\left\lbrack {\rho \; e^{\frac{i\; s\; \pi \; m}{2}\sigma_{\upsilon}}e^{\frac{i\; \pi \; m^{\prime}}{2}{\sigma_{\upsilon}{(\theta_{j})}}}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{\upsilon}}} \right\rbrack} \right\rbrack}}}}}} \right\rbrack} & (105) \end{matrix}$

The gradient can hence be approximated to error ϵ with O(poly(M₁, M₂, K₁, L, s, Δ, μ)) computation on a classical computer and using only the Hadamard test, Gibbs state preparation and LCU on a quantum device.

Notably the expression in Eq. 105 can now be evaluated with a quantum-classical hybrid device by evaluating each term in the trace separately via a Hadamard test and, since the number of terms is only polynomial, and then evaluating the whole sum efficiently on a classical device.

Proof.

For the proof we perform the following steps. Let σ_(i)(ρ) be the singular values of ρ, which are equivalently the eigenvalues since ρ is Hermitian. Then observe that the gradient can be separated in different terms, i.e., let log_(K) ₁ _(,M) ₁ ^(s)σ_(v) be the approximation as given above for a finite sample of the expectation values

_(s), then

$\begin{matrix} {{{{\partial_{\theta}{{Tr}\left\lbrack {{\rho log}\sigma_{\upsilon}} \right\rbrack}} - {\partial_{\theta}{{Tr}\left\lbrack {\rho \log_{K_{1},M_{1}}^{s}{\overset{\sim}{\sigma}}_{\upsilon}} \right\rbrack}}}} \leq \leq {\sum\limits_{i}{{\sigma_{i}(\rho)} \cdot {{\partial_{\theta}\left\lbrack {{\log \sigma_{\upsilon}} - {\log_{K_{1},M_{1}}^{s}{\overset{\sim}{\sigma}}_{\upsilon}}} \right\rbrack}}}} \leq {\sum\limits_{i}{{\sigma_{i}(\rho)} \cdot \left( {{{\partial_{\theta}\left\lbrack {{\log \sigma_{\upsilon}} - {\log_{K_{1},M_{1}}\sigma_{\upsilon}}} \right\rbrack}} + {{\partial_{\theta}\left\lbrack {{\log_{K_{1},M_{1}}\sigma_{\upsilon}} - {\log_{K_{1},M_{1}}{\overset{\sim}{\sigma}}_{\upsilon}}} \right\rbrack}} + {{\partial_{\theta}\left\lbrack {{\log_{K_{1},M_{1}}{\overset{\sim}{\sigma}}_{\upsilon}} - {\log_{K_{1},M_{1}}^{s}{\overset{\sim}{\sigma}}_{\upsilon}}} \right\rbrack}}} \right)}}} & (106) \end{matrix}$

where the second step follows from the Von-Neumann trace inequality and the terms are (1) the error in approximating the logarithm, (2) the error introduced by the divided difference and the approximation of σ_(v) as a Fourier-like series, and (3) is the finite sampling approximation error. It is now possible to bound the different terms separately, and to start with the first part which is in general harder to estimate. The bound is partitioned into three terms, corresponding to the three different approximations taken above.

$\begin{matrix} {\mspace{79mu} {{{\partial_{\theta}\left\lbrack {{\log \sigma_{\upsilon}} - {\log_{K_{1},M_{1}}\sigma_{\upsilon}}} \right\rbrack}} \leq}} & (107) \\ {\mspace{70mu} {\leq {{{\partial_{\theta}{\sum\limits_{k = {K_{1} + 1}}^{\infty}\; {\frac{\left( {- 1} \right)^{k}}{k}\sigma_{\upsilon}^{k}}}}} + {{\partial_{\theta}{\sum\limits_{k = 1}^{K_{1}}\; {\frac{\left( {- 1} \right)^{k}}{k}{\sum\limits_{l = L}^{\infty}{b_{l}^{(k)}{\sin^{l}\left( {\sigma_{\upsilon}\pi \text{/}2} \right)}}}}}}}}}} & (108) \\ {+ {{\partial_{\theta}{\sum\limits_{k = 1}^{K_{1}}\; {\frac{\left( {- 1} \right)^{k}}{k}{\sum\limits_{l = L}^{\infty}{{b_{l}^{(k)}\left( \frac{1}{2} \right)}^{l}{\sum\limits_{m \in {{\lbrack{0,{{\lceil{l/2}\rceil} - M_{1}}}\rbrack}\bigcup{\lbrack{{{\lfloor{l/2}\rfloor} + M_{1}},l}\rbrack}}}^{\;}\; {\left( {- 1} \right)^{m}e^{{i{({{2\; m} - l})}}\sigma_{\upsilon}{\pi/2}}}}}}}}}}} & (109) \end{matrix}$

The first term can be bound in the following way:

$\begin{matrix} {{{\leq {\sum\limits_{k = {K_{1} + 1}}^{\infty}{\sigma_{\upsilon}}^{k - 1}}} = \frac{{\sigma_{\upsilon}}^{K_{1}}}{1 - {\sigma_{\upsilon}}}},} & (110) \end{matrix}$

and, assuming ∥σ_(v)∥<1, we hence can set

K ₁≥log((1−∥σ_(v)∥)ϵ/9)/log(∥σ_(v)∥)  (111)

appropriately in order to achieve an ϵ/9 error. The second term can be bound by assuming that ∥σ_(v)π∥<1, and choosing

$\begin{matrix} {{L \geq {\log \frac{\left( \frac{\epsilon}{9\pi \; K{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}} \right)}{\log \left( {{\sigma_{\upsilon}}\pi} \right)}}},} & (112) \end{matrix}$

which we derive by observing that

$\begin{matrix} {\leq {\sum\limits_{k = 1}^{K}{\frac{1}{k}{\sum\limits_{l = L}^{\infty}{b_{l}^{(k)}l{{{\sin^{l - 1}\left( {\sigma_{\upsilon}{\pi/2}} \right)}} \cdot {{\frac{\pi}{2}\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}}}}}}} & (113) \\ {< {\sum\limits_{k = 1}^{K}{\frac{1}{k}{\sum\limits_{l = {L + 1}}^{\infty}{b_{l}^{(k)}\pi {{{\sigma_{\upsilon}\pi}}^{l - 1} \cdot {\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}}}}}} & (114) \\ {{\leq {\sum\limits_{k = 1}^{K}{\frac{1}{k}\pi {{{\sigma_{\upsilon}\pi}}^{L} \cdot {\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}}}},} & (115) \end{matrix}$

where l<2^(l) is used in the second step. Finally, the last term can be bound similarly, which yields

$\begin{matrix} {\leq {\sum\limits_{k = 1}^{K}{\frac{1}{k}{\sum\limits_{l = 1}^{L}{b_{l}^{(k)}{e^{{- 2}{{(M_{1})}^{2}/l}} \cdot l \cdot \frac{\pi}{2}}{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}}}}} & (116) \\ {\leq {\sum\limits_{k = 1}^{K}{\frac{L}{k}{\sum\limits_{l = 1}^{L}{b_{l}^{(k)}e^{{- 2}{(M_{1})}^{2}\text{/}L}\frac{\pi}{2}{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}}}}} & (117) \\ {{\leq {\sum\limits_{k = 1}^{K}{\frac{L}{k}e^{{- 2}{(M_{1})}^{2}\text{/}L}\frac{\pi}{2}{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}} \leq {\frac{KL\pi}{2}e^{{- 2}{(M_{1})}^{2}\text{/}L}{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}}},} & (118) \end{matrix}$

and we can hence chose

$\begin{matrix} {M_{1} \geq \sqrt{L\; {\log\left( \frac{9{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}K_{1}L\pi}{2\epsilon} \right)}}} & (119) \end{matrix}$

in order to decrease the error to e/3 for the first term in Eq. 106. For the second term, first note that with the notation chosen, ∥∂_(θ)[log_(K) ₁ _(,M) ₁ σ_(v)−log_(K) ₁ _(,M) ₁ {tilde over (σ)}_(v)]∥ is the difference between the log-approximation where the gradient of σ_(v) is still exact, i.e., Eq. 89, and the version where the gradient is approximated via divided differences and the linear combination of unitaries, given in Eq. 105. Recall that the first level approximation was given by

$\begin{matrix} {{\sum\limits_{m = {- M_{1}}}^{M_{1}}{\frac{ic_{m}m\pi}{2}{\int_{0}^{1}{{ds}\; {{Tr}\left\lbrack {\rho e^{\frac{is\pi m}{2}\sigma_{\upsilon}}\frac{\partial\sigma_{\upsilon}}{\partial\theta}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{\upsilon}}} \right\rbrack}}}}},} & (120) \end{matrix}$

where one returns from the expectation value formulation back to the integral formulation to avoid consideration of potential errors due to sampling.

Bounding the difference hence yields one term from the divided difference approximation of the gradient and an error from the Fourier series, which are both bounded separately. Denoting ∂{tilde over (p)}(θ_(k))/∂θ as the divided difference and the LCU approximation of the Fourier series¹, and with ∂p(θ_(k))/∂θ the divided difference without approximation via the Fourier series, ¹ which effectively means that the coefficients of the interpolation polynomial are approximated

$\begin{matrix} {\mspace{79mu} {{{\partial_{\theta}\left\lbrack {{\log_{K_{1},M_{1}}\sigma_{\upsilon}} - {\log_{K_{1},M_{1}}{\overset{\sim}{\sigma}}_{\upsilon}}} \right\rbrack}} \leq}} & (121) \\ {\leq {{\sum\limits_{m = {- M_{1}}}^{M_{1}}{\frac{ic_{m}m\pi}{2}{\int_{0}^{1}{{ds}\; {{Tr}\left\lbrack {\rho {e^{\frac{is\pi m}{2}\sigma_{\upsilon}}\left( {\frac{\partial\sigma_{\upsilon}}{\partial\theta} - \frac{\partial{\overset{\sim}{p}\left( \theta_{k} \right)}}{\partial\theta}} \right)}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{\upsilon}}} \right\rbrack}}}}}}} & (122) \\ {\mspace{79mu} {\leq {\frac{M_{1}\pi {a}_{1}}{2}{\int_{0}^{1}{ds{\sum\limits_{i}{{\sigma_{i}(\rho)}{{\frac{\partial\sigma_{\upsilon}}{\partial\theta} - \frac{\partial{\overset{\sim}{p}\left( \theta_{k} \right)}}{\partial\theta}}}}}}}}}} & (123) \\ {\leq {\frac{M_{1}\pi {a}_{1}}{2}{\int_{0}^{1}{ds{\sum\limits_{i}{{\sigma_{i}(\rho)}\left( {{{\frac{\partial\sigma_{\upsilon}}{\partial\theta} - \frac{\partial{p\left( \theta_{k} \right)}}{\partial\theta}}} + {{\frac{\partial{p\left( \theta_{k} \right)}}{\partial\theta} - \frac{\partial{\overset{\sim}{p}\left( \theta_{k} \right)}}{\partial\theta}}}} \right)}}}}}} & (124) \\ {\leq {\frac{M_{1}{a}_{1}\pi}{2}{\int_{0}^{1}{ds{\sum\limits_{i}{{\sigma_{i}(\rho)}\left( {{{\frac{\partial^{\mu + 1}\sigma_{\upsilon}}{\partial\theta^{\mu + 1}}}\left( \frac{\Delta}{\mu - 1} \right)^{\mu}\frac{\max_{k}{\left( {\mu - k} \right)!}}{\left( {\mu + 1} \right)!}} + {\sum\limits_{j = 0}^{\mu}{{{\mathcal{L}_{\mu,j}^{\prime}\left( \theta_{j} \right)}}{{\sigma_{\upsilon} - {\overset{\sim}{\sigma}}_{\upsilon}}}}}} \right)}}}}}} & (125) \\ {{\leq {\frac{M_{1}{a}_{1}\pi}{2}\left( {{{\frac{\partial^{\mu + 1}\sigma_{\upsilon}}{\partial\theta^{\mu + 1}}}\left( \frac{\Delta}{\mu - 1} \right)^{\mu}\frac{\mu!}{\left( {\mu + 1} \right)!}} + {\mu {{\mathcal{L}_{\mu,j}^{\prime}\left( \theta_{j} \right)}}_{\infty}\epsilon_{2}}} \right)}},} & (126) \end{matrix}$

where ∥a∥₁=Σ_(k=1) ^(K) ¹ /k, and in the last step the results of Lemma 4 are used. Under appropriate assumptions on the grid-spacing for the divided difference scheme Δ and the number of evaluated points p as well as a bound on the μ+1-st derivative of σ_(v) w.r.t. θ, it is now possible to also bound this error. In order to do so, it is necessary to analyze the μ+1-st derivative of σ_(v)=Tr_(h) [e^(−H)]/Z with Z=Tr [e^(−H)]. For this,

$\begin{matrix} {{\frac{\partial^{\mu + 1}\sigma_{\upsilon}}{\partial\theta^{\mu + 1}}} \leq {\sum\limits_{p = 1}^{\mu + 1}{\begin{pmatrix} {\mu + 1} \\ p \end{pmatrix}{\frac{{\partial^{p}T}{r_{h}\left\lbrack e^{- H} \right\rbrack}}{\partial\theta^{p}}}{\frac{\partial^{\mu + 1 - p}Z^{- 1}}{\partial\theta^{\mu + 1 - p}}}}} \leq {2^{\mu + 1}{\max\limits_{p}{{\frac{{\partial^{p}T}{r_{h}\left\lbrack e^{- H} \right\rbrack}}{\partial\theta^{p}}}{\frac{\partial^{\mu + 1 - p}Z^{- 1}}{\partial\theta^{\mu + 1 - p}}}}}}} & (127) \end{matrix}$

Also,

$\begin{matrix} {{\frac{{\partial^{p}T}{r_{h}\left\lbrack e^{- H} \right\rbrack}}{\partial\theta^{p}}} \leq {{\dim \left( H_{h} \right)}{\frac{\partial^{q}e^{- H}}{\partial\theta^{q}}}}} & (128) \end{matrix}$

where dim(H_(h))=2^(n) ^(h) . In order to bound this, the infinitesimal expansion of the exponent is employed, i.e.,

$\begin{matrix} \begin{matrix} {{\frac{\partial^{q}e^{- H}}{\partial\theta^{q}}} = {{\frac{\partial^{q}}{\partial\theta^{q}}{\lim\limits_{r\rightarrow\infty}{\prod\limits_{j = 1}^{r}e^{{- H}/r}}}}}} \\ {= {{\lim\limits_{r\rightarrow\infty}\ \left( {{\frac{\partial^{q}e^{{- H}/r}}{\partial\theta^{q}}{\prod\limits_{j = 2}^{r}e^{{- H}/r}}} +} \right.}}} \\ {\left. {{\frac{\partial^{q - 1}e^{{- H}/r}}{\partial\theta^{q - 1}}\frac{\partial e^{{- H}/r}}{\partial\theta}{\prod\limits_{j = 3}^{r}e^{{- H}/r}}} + \ldots}\  \right)} \\ {\leq {\lim\limits_{r\rightarrow\infty}\ {\left. {{\left( \frac{{\partial H}/r}{\partial\theta} \right.^{q}\  \cdot r^{q}} + {O\left( \frac{1}{r} \right)}} \right){e^{- H}}}}} \\ {{= {{\frac{\partial H}{\partial\theta}}^{q}{e^{- H}}}},} \end{matrix} & (129) \end{matrix}$

where the last step follows from the r^(q) terms and that the error introduced by the commutations above will be of O(1/r). Observing that ∂_(θ) _(i) H=∂_(θ) _(i) Σ_(j)θ_(j)H_(j)=H_(i) and assuming that λ_(max) is the largest singular eigenvalue of H, this can be bounded by λ_(max) ^(q)∥e^(−H)∥.

$\begin{matrix} \begin{matrix} {{\frac{{\partial^{p}T}{r_{h}\left\lbrack e^{- H} \right\rbrack}}{\partial\theta^{p}}} \leq {{\dim \left( H_{h} \right)}{\frac{\partial^{q}e^{- H}}{\partial\theta^{q}}}}} \\ {\leq {\lambda_{\max}^{p}{\dim \left( H_{h} \right)}{{T_{T}{r\left\lbrack e^{- H} \right\rbrack}}}}} \end{matrix} & (130) \\ \begin{matrix} {{\frac{\partial^{\mu + 1 - p}Z^{- 1}}{\partial\theta^{\mu + 1 - p}}} \leq {\frac{\left. {\left( {\mu + 1 - p} \right)!} \middle| \lambda_{\max} \right|^{\mu + 1 - p}}{Z^{\mu + 2 - p}}{{Tr}\left\lbrack e^{- H} \right\rbrack}}} \\ {\leq {\left( \frac{\mu + 1 - p}{eZ} \right)^{\mu + 1 - p}\frac{e}{Z}{\lambda_{\max}}^{\mu + 1 - p}T{r\left\lbrack e^{- H} \right\rbrack}}} \\ {= {\left( \frac{\mu + 1 - p}{eZ} \right)^{\mu + 1 - p}e{\lambda_{\max}}^{\mu + 1 - p}}} \end{matrix} & (131) \end{matrix}$

We can therefore find a bound for Eq. 127 as

$\begin{matrix} {{\frac{\partial^{\mu + 1}\sigma_{\upsilon}}{\partial\theta^{\mu + 1}}} \leq {e2^{\mu + 1 + n_{h}}\lambda_{\max}^{\mu + 1}{{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}{\max\limits_{p}{\left( \frac{\mu + 1 - p}{eZ} \right)^{\mu + 1 - p}.}}}} & (132) \end{matrix}$

Plugging this result into the bound from above yields

$\begin{matrix} {{{{\partial_{\theta}\left\lbrack {{\log_{K_{1},M_{1}}\sigma_{\upsilon}} - {\log_{K_{1},M_{1}}{\overset{\sim}{\sigma}}_{\upsilon}}} \right\rbrack}} \leq {{\frac{M_{1}{a}_{1}\pi}{2}\left( {e2^{\mu + 1 + n_{h}}\lambda_{\max}^{\mu + 1}{{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}{\max\limits_{p}{\left( \frac{\mu + 1 - p}{eZ} \right)^{\mu + 1 - p}\left( \frac{\Delta}{\mu - 1} \right)^{\mu}\frac{1}{\mu + 1}}}} \right)} + {\frac{M_{1}{a}_{1}\pi}{2}\left( {\mu {{\mathcal{L}_{\mu,j}^{\prime}\left( \theta_{j} \right)}}_{\infty}\epsilon_{2}} \right)}}},} & (133) \end{matrix}$

Note that under the reasonable assumption that 2≤μ<<Z, the maximum is achieved for p=μ+1, and hence the upper bound is

$\begin{matrix} {{{\frac{M_{1}{a}_{1}\pi}{2}\left( {{2^{n_{h}}{e\left( {2{\lambda_{\max}}} \right)}^{\mu + 1}{{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}\left( \frac{\Delta}{\mu - 1} \right)^{\mu}\frac{1}{\mu + 1}} + {\mu {{\mathcal{L}_{\mu,j}^{\prime}\left( \theta_{j} \right)}}_{\infty}\epsilon_{2}}} \right)} \leq {\frac{M_{1}{a}_{1}\pi}{2}\left( {{2^{n_{h}}{e\left( {2\left. \lambda_{\max} \right\rceil} \right)}^{\mu + 1}{{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}\left( \frac{\Delta}{\mu - 1} \right)^{\mu}} + {\mu {{\mathcal{L}_{\mu,j}^{\prime}\left( \theta_{j} \right)}}_{\infty}\epsilon_{2}}} \right)}},} & (134) \end{matrix}$

and hence a bound is obtained on μ, the grid point number, in order to achieve an error of ϵ/6>0 for the former term, which is given by

$\begin{matrix} {{\mu \geq {\left( {{\lambda_{\max}}\Delta} \right){\exp\left( {W\left( \frac{\log\left( {2^{n_{h}}\frac{\begin{matrix} {6M_{1}{a}_{1}e^{2}{\lambda_{\max}}} \\ {\pi {{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}} \end{matrix}}{\epsilon}} \right)}{2\lambda_{\max}\Delta} \right)} \right)}}},} & (135) \end{matrix}$

where W is the Lambert function, also known as product-log function, which generally grows slower than the logarithm in the asymptotic limit. Note that μ can hence be lower bounded by

$\begin{matrix} {{\mu \geq {n_{h} + {\log\left( \frac{\begin{matrix} {6M_{1}{a}_{1}e^{2}{\lambda_{\max}}} \\ {\pi {{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}} \end{matrix}}{\epsilon} \right)}}}:={n_{h} + {{\log \left( \frac{M_{1}\Lambda}{\epsilon} \right)}.}}} & (136) \end{matrix}$

For convenience, ϵ is chosen such that n_(h)+log(M₁Λ/ϵ) is an integer. This is done simply to avoid having to keep track of ceiling or floor functions in the following discussion where μ=n_(h)+log(M₁Λ/ϵ) is chosen.

For the second part, the derivative of the Lagrangian interpolation polynomial will be bounded. First, note that

${\mathcal{L}_{\mu,j}^{\prime}(\theta)} = {{\Sigma_{{l = 0};{l \neq j}}^{\mu}\left( {\Pi_{{{k = 0};{k \neq j}},l}\frac{\theta - \theta_{k}}{\theta_{j} - \theta_{k}}} \right)}\frac{1}{\theta_{j} - \theta_{l}}}$

for a chosen discretization of the space such that θ_(k)−θ_(j)=(k−j)Δ/μ can be bound by using a central difference formula, such that an uneven number of points may be used (i.e., one takes μ=2κ+1 for positive integer κ) and chooses the point m at which one evaluates the gradient as the central point of the mesh. Note that in this case, for μ≥5 and θ_(m) being the parameters at the midpoint of the stencil,

$\begin{matrix} {{{{\mathcal{L}_{\mu,j}^{\prime}}_{\infty} \leq {\sum\limits_{l \neq j}{\prod\limits_{{k \neq j},l}{\frac{{\theta_{m} - \theta_{k}}}{{\theta_{j} - \theta_{k}}}\frac{1}{{\theta_{l} - \theta_{j}}}}}} \leq {\frac{\left( {\kappa!} \right)^{2}}{\left( {\kappa!} \right)^{2}}\frac{\mu}{\Delta}{\sum\limits_{l \neq j}\frac{1}{\left| {l - j} \right|}}} \leq {\frac{2\mu}{\Delta}{\sum\limits_{l = 1}^{\kappa}\frac{1}{l}}} \leq {\frac{2\mu}{\Delta}\left( {1 + {\int_{1}^{\kappa - 1}{\frac{1}{l}d\; }}} \right)}} = {{\frac{2\mu}{\Delta}\left( {1 + {\log \left( {\left( {\mu - 3} \right)\text{/}2} \right)}} \right)} \leq {\frac{5\mu}{\Delta}{\log \left( {\mu \text{/}2} \right)}}}},} & (137) \end{matrix}$

where the last inequality follows from the fact that μ≥5 and 1+ln(5/2)<(5/2) ln(5/2). Now, plugging in the μ from Eq. 148, this error is bound by

$\begin{matrix} {{{{\mathcal{L}_{\mu,j}^{\prime}}_{\infty} \leq {\frac{{5n_{h}} + {5{\log \left( \frac{M_{1}\Lambda}{\epsilon} \right)}}}{\Delta}{\log \left( {{n_{h}\text{/}2} + {{\log \left( \frac{M_{1}\Lambda}{\epsilon} \right)}\text{/}2}} \right)}}} = {\overset{\sim}{O}\left( \frac{n_{h} + {\log \left( \frac{M_{1}\Lambda}{\epsilon} \right)}}{\Delta} \right)}},} & (138) \end{matrix}$

If an upper bound of ϵ/6 is desired for the second term of the error in Eq. 134, then the following is required:

$\begin{matrix} \begin{matrix} {\epsilon_{2} \leq \frac{\epsilon}{15M_{1}{a}_{1}{\pi\mu}{{\mathcal{L}_{\mu,j}^{\prime}(\theta)}}_{\infty}}} \\ {\leq \frac{\epsilon\Delta}{\begin{matrix} {15M_{1}{a}_{1}{\pi \left( {n_{h} + {\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}} \right)}^{2}} \\ {\log \left( {\left( {n_{h}\text{/}2} \right) + {{\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}\text{/}2}} \right)} \end{matrix}}} \\ {\leq \frac{\epsilon\Delta}{15M_{1}{a}_{1}\pi \mu^{2}{\log \left( {\left( {\mu - 1} \right)\text{/}2} \right)}}} \end{matrix} & (139) \end{matrix}$

Hence, the approximation error due to the divided differences and Fourier series approximation of or is bounded by ϵ/3 for the above choice of ϵ₂ and μ. This bounds the second term in Eq. 126 by ϵ/3.

Finally, it is necessary to take into account the error ∥∂_(θ)[log_(K) ₁ _(,M) ₁ {tilde over (σ)}_(v)−log_(K) ₁ _(,M) ₁ ⁸{tilde over (σ)}_(v)] which is introduced through the sampling process, i.e., through the finite sample estimate of

_(s)[⋅] here indicated with the superscript s over the logarithm. Note that this error can be bound straightforwardly by Eq. 105. It is necessary only to bound the error introduced via the finite amount of samples we take, which is a well-known procedure. The concrete bounds for the sample error when estimating the expectation value are stated in the following lemma.

Lemma 6.

Let σ_(m) be the sample standard deviation of the random variable

$\begin{matrix} {{{\overset{\sim}{}}_{s \in {\lbrack{0,1}\rbrack}}\left\lbrack {{Tr}\left\lbrack {\rho \; e^{\frac{is\pi m}{2}\sigma_{\upsilon}}e^{\frac{i\; \pi \; m^{\prime}}{2}{\sigma_{\upsilon}{(\theta_{j})}}}e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{\upsilon}}} \right\rbrack} \right\rbrack},} & (140) \end{matrix}$

such that the sample standard deviation is given by

$\sigma_{k} = {\frac{\sigma_{m}}{\sqrt{k}}.}$

Then with probability at least 1−δ_(s), we can obtain an estimate which is within ϵ_(s)σ_(m) of the mean by taking

$k = \frac{4}{\epsilon_{s}^{2}}$

samples for each sample estimate and taking the median of O(log(1/δ_(s))) such samples.

Proof.

From Chebyshev's inequality taking

$k = \frac{4}{\epsilon_{s}^{2}}$

samples implies that with probability of at least p=¾ each of the mean estimates is within 2σ_(k)=ϵ_(s)σ_(m) from the true mean. Therefore, using standard techniques, one takes the median of O(log(1/δ_(s))) such estimates which gives with probability 1−δ_(s) an estimate of the mean with error at most ϵ_(s)σ_(m), which implies that the procedure must be repeated

$O\left( {\frac{1}{\epsilon_{s}^{2}}{\log \left( \frac{1}{\delta_{s}} \right)}} \right)$

times.

It is now possible to bound the error of the sampling step in the final estimate, denoting with ϵ_(s) the sample error, as

$\begin{matrix} {{{\sum\limits_{m = {- M_{1}}}^{M_{1}}{\sum\limits_{m^{\prime} = {- M_{2}}}^{M_{2}}{{\frac{ic_{m}c_{m^{\prime}}m\; \pi}{2}}{\sum\limits_{j = 0}^{\mu}{{{\mathcal{L}_{\mu,j}^{\prime}(\theta)}}\epsilon_{s}\sigma_{m}}}}}} \leq \frac{5{a}_{1}M_{1}\epsilon_{s}\sigma_{m}\pi \mu^{2}{\log \left( \frac{\mu}{2} \right)}}{\Delta} \leq \frac{\epsilon}{3}},} & (141) \end{matrix}$

Hence, for

$\begin{matrix} \begin{matrix} {\epsilon_{s} \leq \frac{\epsilon \; \Delta}{15{a}_{1}M_{1}\sigma_{m}\pi \mu^{2}{\log \left( \frac{\mu}{2} \right)}}} \\ {\leq \frac{\epsilon \; \Delta}{\begin{matrix} {15M_{1}{a}_{1}\sigma_{m}{\pi \left( {n_{h} + {\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}} \right)}^{2}} \\ {\log \left( {\left( {n_{n}\text{/}2} \right) + {{\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}\text{/}2}} \right)} \end{matrix}}} \\ {\leq \frac{\epsilon \; \Delta}{15M_{1}{a}_{1}\sigma_{m}{\pi\mu}^{2}{\log \left( {\mu \text{/}2} \right)}}} \end{matrix} & (142) \end{matrix}$

also the last term in Eq. 106 can be bounded by ϵ/3, which together results in an overall error of ϵ for the various approximation steps, which concludes the proof.

Notably all quantities which occur in these bounds are only polynomial in the number of the qubits. The lower bounds for the choice of parameters are summarized in the following inequalities (Eq. 143-150).

$\begin{matrix} {M_{1} \geq \sqrt{L\; {\log\left( \frac{9{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}K_{1}L\pi}{2\epsilon} \right)}}} & (143) \\ {M_{2} \geq \left\lceil {{\log \left( \frac{4}{\epsilon_{2}} \right)}\sqrt{\left( {2\log \delta_{u}^{- 1}} \right)^{- 1}}} \right\rceil} & (144) \\ {K_{1} \geq {{\log \left( {\left( {1 - {\sigma_{\upsilon}}} \right){\epsilon/9}} \right)}\text{/}{\log \left( {\sigma_{\upsilon}} \right)}}} & (145) \\ {K_{2} \geq \frac{\log \left( {4\text{/}\epsilon_{2}} \right)}{\log \left( \delta_{u}^{- 1} \right)}} & (146) \\ {L \geq \frac{\log\left( \frac{\epsilon}{9\pi K_{1}{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}} \right)}{\log \left( {{\sigma_{\upsilon}}\pi} \right)}} & (147) \\ {{\mu \geq {n_{h} + {\log\left( \frac{\begin{matrix} {6M_{1}{a}_{1}e^{2}{\lambda_{\max}}} \\ {\pi {{{Tr}_{h}\left\lbrack e^{- H} \right\rbrack}}} \end{matrix}}{\epsilon} \right)}}}:={n_{h} + {\log \left( {M_{1}{\Lambda/\epsilon}} \right)}}} & (148) \\ \begin{matrix} {\epsilon_{2} \leq \frac{\epsilon \; \Delta}{\begin{matrix} {15M_{1}{a}_{1}{\pi \left( {n_{h} + {\log \left( {M_{1}{\Lambda/\epsilon}} \right)}} \right)}^{2}} \\ {\log \left( {\left( {n_{h}\text{/}2} \right) + {{\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}\text{/}2}} \right)} \end{matrix}}} \\ {\leq \frac{\epsilon \; \Delta}{15M_{1}{a}_{1}\pi \mu^{2}{\log \left( {\left( {\mu - 1} \right)\text{/}2} \right)}}} \end{matrix} & (149) \\ \begin{matrix} {\epsilon_{s} \leq \frac{\epsilon \; \Delta}{\begin{matrix} {15M_{1}{a}_{1}\sigma_{m}{\pi \left( {n_{h} + {\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}} \right)}^{2}} \\ {\log \left( {\left( {n_{n}\text{/}2} \right) + {{\log \left( {M_{1}\Lambda \text{/}\epsilon} \right)}\text{/}2}} \right)} \end{matrix}}} \\ {\leq \frac{\epsilon \; \Delta}{15M_{1}{a}_{1}\sigma_{m}\pi \mu^{2}{\log \left( {\mu \text{/}2} \right)}}} \end{matrix} & (150) \end{matrix}$

Operationalising.

In the following two established subroutines are used, namely sample based Hamiltonian simulation (aka the LMR protocol) [5], as well as the Hadamard test, in order to evaluate the gradient approximation as defined in Eq. 105. In order to hence derive the query complexity for this algorithm, it is necessary only to multiply the cost of the number of factors we need to evaluate with the query complexity of these routines. For this the following result is relied upon.

Theorem 3.

Sample based Hamiltonian simulation [6]] Let 0≤ϵ_(h)≤⅙ be an error parameter and let ρ be a density for which multiple copies are obtained through queries to a oracle O_(ρ). It is now possible to simulate the time evolution e^(−iρt) up to error ϵ_(h) in trace norm as long as ϵ_(h)/t≤1/(6π) with Θ(t²/ϵ_(h)) copies of ρ and hence queries to O_(ρ).

Needed particularly are terms of the form

$\begin{matrix} {{Tr}\left\lbrack {\rho \; e^{\frac{{is}\; \pi \; m}{2}\sigma_{\upsilon}}e^{\frac{i\; \pi \; m^{\prime}}{2}{\sigma_{\upsilon}{(\theta_{j})}}}e^{\frac{{i{({1 - s})}}\pi \; m}{2}\sigma_{\upsilon}}} \right\rbrack} & (151) \end{matrix}$

Note that it is possible to simulate every term in the trace (except ρ) via the sample based Hamiltonian simulation approach to error ϵ_(h) in trace norm. This will introduce a additional error which must be taken into account for the analysis. Let Ũ_(i), i∈{1,2,3} be the unitaries such that ∥U_(i)−Ũ_(i)∥_(*)≤ϵ_(h) where the U_(i) are corresponding to the factors in Eq. 151, i.e.,

${U_{1}:=e^{\frac{is\pi m}{2}\sigma_{\upsilon}}},{U_{2}:=e^{\frac{i\; \pi \; m^{\prime}}{2}{\sigma_{\upsilon}{(\theta_{j})}}}},{{{and}\mspace{14mu} U_{3}}:={e^{\frac{{i{({1 - s})}}\pi m}{2}\sigma_{\upsilon}}.}}$

The error is now bounded as follows. First note that ∥Ũ_(i)∥≤∥Ũ_(i)−U_(i)∥+∥U_(i)∥≤1+ϵ_(h), using Theorem 3 and the fact that the spectral norm is upper bounded by the trace norm.

Tr[ρU ₁ U ₂ U ₃]−Tr[ρŨ ₁ Ũ ₂ Ũ ₃]=Tr[ρU ₁ U ₂ U ₃ −ρŨ ₁ Ũ ₂ Ũ ₃]≤∥U ₁ U ₂ U ₃ −Ũ ₁ Ũ ₂ Ũ ₃ ∥≤∥U ₁ −Ũ ₁ ∥∥Ũ ₂ ∥∥Ũ ₃ ∥+∥U ₂ −Ũ ₂ ∥Ũ ₃ +∥U ₃ −Ũ ₃ ∥≤∥U ₁ −Ũ ₁∥_(*)(1+ϵ_(h))² +∥U ₂ −Ũ ₂∥_(*)(1+ϵ_(h))+∥U ₃ −Ũ ₃∥_(*)≤ϵ_(h)(1+ϵ_(h))²+ϵ_(h)(1+ϵ_(h))+ϵ_(h) =O(ϵ_(h)),  (152)

neglecting higher orders of ϵ_(h), and where in the first step the Von-Neumann trace inequality is applied, as well as the fact that ρ is Hermitian. In the last step, moreover, the results of Theorem 3 are used.

One now requires O((max{M₁,M₂}π)²/ϵ_(h)) queries to the oracles for σ_(v) for the evaluation of each term in the multi sum in Eq. 105. Note that the Hadamard test has a query cost of O(1). In order to hence achieve an overall error of ϵ in the gradient estimation, the error introduced by the sample based Hamiltonian simulation must also be of O(e). Therefore, it is required that

${\epsilon_{h} \leq {O\left( \frac{\epsilon \; \Delta}{5{a}_{1}M_{1}\pi \mu^{2}{\log \left( {\mu \text{/}2} \right)}} \right)}},$

similar to the sample based error which yield the query complexity of

$\begin{matrix} {O\left( \frac{\max \left\{ {M_{1},M_{2}} \right\}^{2}{a}_{1}M_{1}\pi^{3}\mu^{2}{\log \left( {\mu \text{/}2} \right)}}{\epsilon \; \Delta} \right)} & (153) \end{matrix}$

Adjusting the constants gives then the required bound of ϵ of the total error and the query complexity for the algorithm to the Gibbs state preparation procedure is consequentially given by the number of terms in Eq. 105 times the query complexity for the individual term, yielding

$\begin{matrix} {{O\left( \frac{M_{1}^{2}M_{2}\max \left\{ {M_{1},M_{2}} \right\}^{2}{a〛}_{1}\sigma_{m}\pi^{3}\mu^{3}{\log \left( \frac{\mu}{2} \right)}}{\epsilon \; \epsilon_{s}^{2}\Delta} \right)},} & (154) \end{matrix}$

and classical precomputation polynomial in M₁, M₂, K₁, L, s, Δ, μ, where the different quantities are defined in Eq. 143-150. Taking into account the query complexity of the individual steps then results in the following theorem.

Theorem 4.

Let ρ, σ_(v) being two density matrices, ∥σ_(v)∥<1/π, and we have access to an oracle O_(H) for the d-sparse Hamiltonian H(θ) and an oracle O_(ρ) which returns copies of purified density matrix of the data ρ, and ϵ∈(0, ⅙) an error parameter. With probability at least ⅔ an estimate

of the gradient w.r.t. θ∈

^(D) of the relative entropy ∇_(θ)Tr [ρ log σ_(v)] is obtained, such that

∥∇_(θ)Tr[ρ log σ_(v)]−

∥_(max)≤ϵ,  (155)

with

$\begin{matrix} {{\overset{\sim}{O}\left( {\sqrt{\frac{N}{z}}\frac{D{{H(\theta)}}d\; \mu^{5}\gamma}{\epsilon^{3}}} \right)},} & (156) \end{matrix}$

queries to O_(H) and O_(ρ), where μ∈O(n_(h)+log(1/ϵ)), ∥∂_(θ)σ_(v)∥≤e^(γ), ∥σ_(v)∥≥2^(−n) ^(v) for n_(v) being the number of visible units and n_(h) being the number of hidden units, and

Õ(poly(γ,n _(v) ,n _(h),log(1/ϵ)))  (157)

classical precomputation.

Proof.

The runtime follows straightforwardly by using the bounds derived in Eq. 153 and Lemma 5, and by using the bounds for the parameters M₁, M₂, K₁, L, μ, Δ, s given in Eq. (143-150). For the success probability for estimating the whole gradient with dimensionality d, it is now possible to again make use of the boosting scheme used in Eq. 61 to be

$\begin{matrix} {{\overset{\sim}{O}\left( {\frac{d{a}_{1}^{3}\sigma_{m}^{3}\mu^{5}{\log^{3}\left( {\mu \text{/}2} \right)}{poly}\; \log^{({\frac{\frac{{\partial\sigma}v}{\partial\theta}}{\epsilon},\frac{n_{h}^{2}{a}_{1}\sigma_{m}}{\epsilon \; \Delta}})}}{\epsilon^{3}\Delta^{3}}{\log (d)}} \right)},} & (158) \end{matrix}$

where μ=n_(h)+log(M₁Λ/ϵ).

Next it is necessary to take into account the errors from the Gibbs state preparation given in Lemma 3. For this note that the error between the perfect Hamiltonian simulation of σ_(v) and the sample based Hamiltonian simulation with an erroneous density matrix denoted by Ũ, i.e., including the error from the Gibbs state preparation procedure, is given by

∥Ũ−e ^(−iσ) ^(v) ^(t) ∥≤∥Ũ−e ^(−i{tilde over (σ)}) ^(v) ^(t) ∥+∥e ^(−i{tilde over (σ)}) ^(v) ^(t) −e ^(−iσ) ^(v) ^(t)∥≤ϵ_(h)+ϵ_(G) t  (159)

where ϵ_(h) is the error of the sample based Hamiltonian simulation, which holds since the trace norm is an upper bound for the spectral norm, and ∥σ_(v)−{tilde over (σ)}_(v)∥≤ϵ_(G) is the error for the Gibbs state preparation from Theorem 1 for a d-sparse Hamiltonian, for a cost

$\begin{matrix} {{\overset{\sim}{}\left( {\sqrt{\frac{N}{z}}{H}d\; {\log \left( \frac{H}{\epsilon_{G}} \right)}{\log \left( \frac{1}{\epsilon_{G}} \right)}} \right)}.} & (160) \end{matrix}$

From Eq. 152 it is known that the error ϵ_(h) propagates nearly linear, and hence it suffices for us to take ϵ_(G)≤ϵ_(h)/t where t=O(max{M₁, M₂}) and adjust the constants ϵ_(h)−ϵ_(h)/2 in order to achieve the same precision ϵ in the final result. It is now required that

$\begin{matrix} {\overset{\sim}{}\left( {\sqrt{\frac{N}{z}}{{H(\theta)}}{\log\left( \frac{\begin{matrix} {{H(\theta)}} \\ {\max \left\{ {M_{1},M_{2}} \right\}} \end{matrix}}{\epsilon_{h}} \right)}{\log \left( \frac{\max \left\{ {M_{1},M_{2}} \right\}}{\epsilon_{h}} \right)}} \right)} & (161) \end{matrix}$

and using the ϵ_(h) from before we hence find that this s bound by

$\begin{matrix} {\overset{\sim}{}\left( {\sqrt{\frac{N}{z}}{{H(\theta)}}{\log\left( \frac{{{H(\theta)}}n_{h}^{2}}{\epsilon\Delta} \right)}{\log\left( \frac{n_{h}^{2}}{\epsilon\Delta} \right)}} \right)} & (162) \end{matrix}$

query complexity to the oracle of H for the Gibbs state preparation.

The procedure succeeds with probability at least 1−δ_(s) for a single repetition for each entry of the gradient. In order to have a failure probability of the final algorithm of less than ⅓, it is necessary to repeat the procedure for all D dimensions of the gradient and take for each the median over a number of samples. Let n_(f) be as previously the number of instances of the one component of the gradient such that the error is larger than ϵ_(s)σ_(m) and n_(s) be the number of instances with an error ≤ϵ_(s)σ_(m), and the result taken is the median of the estimates, where n=n_(s)+n_(f) samples are collected. The algorithm gives a wrong answer for each dimension if

${n_{s} \leq \left\lfloor \frac{n}{2} \right\rfloor},$

since then the median is a sample such that the error is larger than ϵ_(s)σ_(m). Let p=1−δ_(s) be the success probability to draw a positive sample, as is the case of the algorithm. Since each instance of(recall that each sample here consists of a number of samples itself) from the algorithm will independently return an estimate for the entry of the gradient, the total failure probability is bounded by the union bound, i.e.,

$\begin{matrix} {{{Pr_{fail}} \leq {D \cdot {\Pr \left\lbrack {n_{s} \leq \left\lfloor \frac{n}{2} \right\rfloor} \right\rbrack}} \leq {D \cdot {e^{{- \frac{n}{2{({1 - \delta_{s}})}}}{({{({1 - \delta_{s}})} - \frac{1}{2}})}}}^{2}} \leq \frac{1}{3}},} & (163) \end{matrix}$

which follows from the Chernoff inequality for a binomial variable with 1−δ_(s)>½, which is given in the present case for a proper choice of δ_(s)<½. Therefore, by taking

${{n \geq {\frac{2 - {2\delta_{s}}}{\left( {{1\text{/}2} - \delta_{s}} \right)^{2}}{\log \left( {3D} \right)}}} = {O\left( {\log \left( {3D} \right)} \right)}},$

a total failure probability of at least ⅓ is achieved for a constant, fixed δ_(s). Note that this for a constant, fixed. Note that this hence results in an multiplicative factor of O(log(D)) in the query complexity of Eq. 156. The total query complexity to the oracle O_(ρ) for a purified density matrix of the data ρ and the Hamiltonian oracle O_(H) is then given by

$\begin{matrix} {{\overset{\sim}{O}\left( {\sqrt{\frac{N}{z}}\frac{\begin{pmatrix} {d{\log (d)}{{H(\theta)}}{a}_{1}^{3}\sigma_{m}^{3}\mu^{5}{\log^{3}\left( {\mu \text{/}2} \right)}} \\ {{{poly}\mspace{11mu} \log \frac{\frac{\partial\sigma_{\upsilon}}{\partial\theta}}{\epsilon}},\frac{n_{h}^{2}{a}_{1}\sigma_{m}}{\epsilon \; \Delta},{{H(\theta)}}} \end{pmatrix}}{\epsilon^{3}\Delta^{3}}} \right)},} & (164) \end{matrix}$

which reduces to

$\begin{matrix} {{\overset{\sim}{O}\left( {\sqrt{\frac{N}{z}}\frac{D{{H(\theta)}}d\; \mu^{5}\alpha}{\epsilon^{3}}} \right)},} & (165) \end{matrix}$

hiding the logarithmic factors in the Õ notation.

Returning once again to the drawings, FIG. 7 illustrates an example method 60B to estimate the gradient of the quantum relative entropy of a restricted or non-restricted QBM having visible and hidden nodes. Method 60B may be employed as a particular instance of step 60 in training method of FIG. 5. Each step of this method is developed in detail in the description above; accordingly, the present description provides only summary detail to enable the reader to understand the process flow in one non-limiting example.

In this method, the gradient of the quantum relative entropy is estimated based on one or more high-order divided-difference formulas. At 110 of method 60B, truncated Fourier-series expansions of log(x) and x are computed. At 112 an interpolation polynomial L′(θ) is computed to represent a derivative that appears in the gradient of the quantum relative entropy. At 114 the density operator ρ is computed.

At 116 of method 60B, a control loop is encountered wherein an ancilla qubit is prepared in the state |+

⊗ρ=1/√{square root over (2)}(|0

+|1

)⊗ρ. At 120, a sample-based Hamiltonian simulation is applied to provide a majorised distribution σ_(v) over the one or more visible nodes at fixed s. Then, at 122, a Hadamard gate is applied. At 124, the amplitude of the ancilla qubit state is estimated with the |0

state of the ancilla qubit marked. Accordingly, estimating based on the one or more high-order divided-difference formulas in method 60B includes using the quantum computer to compute one or more divided differences of a training objective function of the QBM using Fourier-series methods.

At each iteration through the control loop, the confidence analysis described above is applied in order to determine whether additional measurements are required to achieve precision ϵ. If so, execution returns to 116. Otherwise, the product of the expectation values and the Fourier and L′(θ) coefficients is evaluated and returned.

Appendix for Approach 1 and Approach 2

Bounds on the Gradient.

Starting with some preliminary equations needed in order to obtain a useful bound on the gradient,

Let A(θ) be a linear operator which depends linearly on the density matrix σ. Then

$\begin{matrix} {{{\frac{\partial}{\partial\theta}{A(\theta)}^{- 1}} = {{- A^{- 1}}\frac{\partial\sigma}{\partial\theta}A^{- 1}}}.} & (166) \end{matrix}$

Proof.

The proof follows straightforwardly by using the identity I.

${\frac{\partial I}{\partial\theta} = {0 = {{{\frac{\partial}{\partial\theta}A}A^{- 1}} = {{\left( \frac{\partial A}{\partial\theta} \right)A^{- 1}} + {A\left( \frac{\partial A^{- 1}}{\partial\theta} \right)}}}}}.$

Reordering the terms completes the proof. This can equally be proven using the Gateau derivative.

In the following one will furthermore rely on the following well-known inequality.

Lemma 7.

Von Neumann Trace Inequality. Let A∈

^(n×n) and B∈

^(n×n) with singular values {σ_(i)(A)}_(i=1) ^(n) and {σ_(i)(B)}_(i=1) ^(n) respectively such that σ_(i)(⋅)≤σ_(j)(⋅) if i≤j. It then holds that

$\begin{matrix} {{{{Tr}\lbrack{AB}\rbrack}} \leq {\sum\limits_{i = 1}^{n}{{\sigma (A)}_{i}{{\sigma (B)}_{i}.}}}} & (167) \end{matrix}$

Note that from this, one immediately obtains

$\begin{matrix} {{{{{Tr}\lbrack{AB}\rbrack}} \leq {\sum\limits_{i = 1}^{n}{{\sigma (A)}_{i}{\sigma (B)}_{i}}} \leq {{\sigma_{\max}(B)}{\sum\limits_{i}{\sigma (A)}_{i}}}} = {{B}{\sum\limits_{i}{{\sigma (A)}_{i}.}}}} & (168) \end{matrix}$

This is particularly useful if A is Hermitian and PSD, since this implies |Tr [AB]|≤∥B∥ Tr [A] for Hermitian A.

Since dealing with operators, the common chain rule of differentiation does not hold generally. Indeed the chain rule is a special case if the derivative of the operator commutes with the operator itself. Since a term of the form log σ(θ) is encountered, one cannot assume that [σ, σ′]=0, where σ′:=σ⁽¹⁾ is the derivative w.r.t., θ. For this case the following identity is needed, similar to Duhamels formula, in the derivation of the gradient for the purely-visible-units Boltzmann machine.

Lemma 8. Derivative of Matrix Logarithm [7]

$\begin{matrix} {{\frac{d}{dt}\log {A(t)}} = {\int\limits_{0}^{1}{\left\lbrack {{sA} + {\left( {1 - s} \right)I}} \right\rbrack^{- 1}{{\left( \frac{dA}{dt} \right)\left\lbrack {{sA} + {\left( {1 - s} \right)I}} \right\rbrack}^{- 1}.}}}} & (169) \end{matrix}$

For completeness a proof of the above identity is now included.

Proof.

The integral definition of the logarithm [8] is used for a complex, invertible, n×n matrix A=A(t) with no real negative

log A=(A−I)∫₀ ¹[s(A−I)+I]⁻¹.  (170)

From this is obtained the derivative

$\begin{matrix} {{\frac{d}{dt}\log \; A} = {\frac{dA}{dt}{\int_{0}^{1}{d{s\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}\left( {A - I} \right){\int_{0}^{1}{ds{{\frac{d}{dt}\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}.}}}}}}} & (171) \end{matrix}$

Applying Eq. 166 to the second term on the right hand side yields

$\begin{matrix} {{{\frac{d}{dt}\log \; A} = {{\frac{dA}{dt}{\int_{0}^{1}{d{s\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}}}} + {\left( {A - I} \right){\int_{0}^{1}{d{s\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}s{\frac{dA}{dt}\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}}}}}},} & (172) \end{matrix}$

which can be rewritten as

$\begin{matrix} {{\frac{d}{dt}\log \; A} = {\int_{0}^{1}{d{{s\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}{\frac{dA}{dt}\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}}}} & (173) \\ {\mspace{79mu} {{{+ \left( {A - I} \right)}{\int_{0}^{1}{d{s\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}s{\frac{dA}{dt}\left\lbrack {{s\left( {A - I} \right)} + I} \right\rbrack}^{- 1}}}},}} & (174) \end{matrix}$

by adding the identity I=[s(A−I)+I][s(A−I)+I]⁻¹ in the first integral and reordering commuting terms (i.e., s). Notice that one can hence just subtract the first two terms in the integral which yields Eq. 169 as desired.

Amplitude Estimation.

The well known amplitude estimation algorithm can be performed via the following steps.

-   -   1. Initialize two registers of appropriate sizes to the state |0         |0         , where         is a unitary transformation which prepares the input state,         i.e., |ψ         =         |0     -   2. Apply the quantum Fourier transform

${QFT}_{N}:\left. {x\rangle}\rightarrow{\frac{1}{\sqrt{N}}\Sigma_{y = 0}^{N - 1}e^{2\pi ix{y/N}}{y\rangle}} \right.$

for 0≤x<N, to the first register.

-   -   3. Apply the controlled-Q operator to the second register, i.e.,         let Λ_(N)(U):|j         |y         →|j         (U^(j)|y         ) for 0≤j<N), then we apply λ_(N)(Q) where Q:=−         S₀         ^(†)S_(t) is the Grover's operator, S₀ changes the sign of the         amplitude if and only if the state is the zero state |0         , and St is the sign-flip operator for the target state, i.e.,         if |x         is the desired outcome, then S_(t):=I−2|x         x|.     -   4. Apply QFT_(N) ^(†) to the first register.     -   5. Output

$\overset{\sim}{a} = {{\sin^{2}\left( {\pi \frac{\overset{\sim}{\theta}}{N}} \right)}.}$

The algorithm can hence be summarized as the unitary transformation

((QFT†⊗I)Λ_(N)(Q)(QFT _(N) ⊗I))  (175)

applied to the state |0

|0

, followed by a measurement of the first register and classical post-processing returns an estimate {tilde over (θ)} of the amplitude of the desired outcome such that |θ−{tilde over (θ)}|≤ϵ with probability at least 8/π². The result is summarized in the following theorem, which states a slightly more general version.

Theorem 5.

Amplitude Estimation [9] For any positive integer k, the Amplitude Estimation Algorithm returns an estimate ã (0≤ã≤1) such that

$\begin{matrix} {{{\overset{\sim}{a} - a}} \leq {{2{\pi k}\frac{\sqrt{a\left( {1 - a} \right)}}{N}} + {k^{2}\frac{\pi^{2}}{N^{2}}}}} & (176) \end{matrix}$

with probability at least

$\frac{8}{\pi^{2}} \approx {{0.8}1}$

for k=1 and with probability greater than

$1 - \frac{1}{2\left( {k - 1} \right)}$

for k≥2. If a=0 then ã=0 with certainty, and if a=1 and N is even, then ã=1 with certainty.

Notice that the amplitude θ can hence be recovered via the relation θ=arcsin √{square root over (θ_(a))} as described above which incurs an ϵ-error for θ (c.f., Lemma 7, [9]).

CONCLUSION AND CLAIMS

One aspect of this disclosure is directed to a method to train a QBM having one or more visible nodes and one or more hidden nodes. The method comprises associating each visible and each hidden node of the QBM to a different corresponding qubit of a plurality of qubits of a quantum computer, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM. The method further comprises providing a distribution of training data over the one or more output qubits, estimating a gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data, and training the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.

In some implementations, the quantum relative entropy S is defined by S(ρ|σ_(v))=Tr (ρ log ρ)−Tr (ρ log σ_(v)), wherein S is a function of density operator ρ conditioned on a majorised distribution σ_(v) over the one or more visible nodes, and wherein Tr is a trace of an operator. In some implementations, the QBM is a restricted QBM, in which every Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM commutes with every other Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM. In some implementations, estimating the gradient includes computing a variational upper bound on the quantum relative entropy. In some implementations, estimating the gradient includes using substantially commuting operators to assign an energy penalty to each qubit corresponding to a hidden node of the QBM. In some implementations, estimating the gradient includes preparing a purified Gibbs state in the plurality of qubits based on one or more Hamiltonians. In some implementations, estimating the gradient of the quantum relative entropy includes estimating based on one or more high-order divided-difference formulas. In some implementations, estimating based on the one or more high-order divided-difference formulas includes using the quantum computer to compute one or more divided differences of a training objective function of the QBM using Fourier-series methods. In some implementations, using the quantum computer to compute the one or more divided differences includes using the quantum computer to compute one or more truncated Fourier-series expansions. In some implementations, estimating the gradient of the quantum relative entropy includes computing an interpolation polynomial to represent a derivative appearing in the gradient. In some implementations, estimating the gradient of the quantum relative entropy includes applying a sample-based Hamiltonian simulation to provide a distribution σ_(v) over the one or more visible nodes.

Another aspect of this disclosure is directed to a quantum computer comprising a register including a plurality of qubits, a modulator configured to implement one or more quantum-logic operations on the plurality of qubits, a demodulator configured to output data based on a quantum state of the plurality of qubits, a controller operatively coupled to the modulator and to the demodulator, and computer memory associated with the controller. The computer memory holds instructions that cause the controller to instantiate a QBM having one or more visible nodes and one or more hidden nodes, wherein each visible and each hidden node corresponds to a different qubit of the plurality of qubits, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM, and wherein the weighting factors are trained using a distribution of training data over the one or more output qubits, based on a previously estimated gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data, using the quantum relative entropy as a cost function.

In some implementations, the instructions cause the controller to estimate the gradient of the quantum relative entropy and to train the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.

Another aspect of this disclosure is directed to a quantum computer comprising a register including a plurality of qubits, a modulator configured to implement one or more quantum-logic operations on the plurality of qubits, a demodulator configured to output data based on a quantum state of the plurality of qubits, a controller operatively coupled to the modulator and to the demodulator, and computer memory associated with the controller. The computer memory holds instructions that cause the controller to instantiate a QBM having one or more visible nodes and one or more hidden nodes, wherein each visible and each hidden node corresponds to a different qubit of the plurality of qubits, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM. The instructions further cause the controller to provide a distribution of training data over the one or more output qubits, estimate a gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data, and train the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.

In some implementations, the QBM is a restricted QBM, in which every Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM commutes with every other Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM, and estimation of the gradient includes computation of a variational upper bound on the quantum relative entropy. In some implementations, estimation of the gradient includes use of substantially commuting operators to assign an energy penalty to each qubit corresponding to a hidden node of the QBM. In some implementations, estimation of the gradient includes preparation of a purified Gibbs state in the plurality of qubits based on one or more Hamiltonians. In some implementations, the gradient of the quantum relative entropy is estimated based on one or more high-order divided-difference formulas, and estimation of the gradient based on the one or more divided-difference formulas includes using the quantum computer to compute one or more divided differences of a training objective function of the QBM using Fourier-series methods. In some implementations, estimation of the gradient includes computation of an interpolation polynomial to represent a derivative appearing in the gradient. In some implementations, estimation of the gradient includes applying a sample-based Hamiltonian simulation to provide a distribution σ_(v) over the one or more visible nodes.

This disclosure is presented by way of example and with reference to the attached drawing figures. Components, process steps, and other elements that may be substantially the same in one or more of the figures are identified coordinately and described with minimal repetition. It will be noted, however, that elements identified coordinately may also differ to some degree. It will be further noted that the figures are schematic and generally not drawn to scale. Rather, the various drawing scales, aspect ratios, and numbers of components shown in the figures may be purposely distorted to make certain features or relationships easier to see.

It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.

The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof. 

1. A method to train a quantum Boltzmann machine (QBM) having one or more visible nodes and one or more hidden nodes, the method comprising: associating each visible and each hidden node of the QBM to a different corresponding qubit of a plurality of qubits of a quantum computer, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM; providing a distribution of training data over the one or more output qubits; estimating a gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data; and training the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.
 2. The method of claim 1 wherein the quantum relative entropy S is defined by S(ρ|σ_(v))=Tr (ρ log ρ)−Tr (ρ log σ_(v)), wherein S is a function of density operator ρ conditioned on a majorised distribution σ_(v) over the one or more visible nodes, and wherein Tr is a trace of an operator.
 3. The method of claim 1 wherein the QBM is a restricted QBM, in which every Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM commutes with every other Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM.
 4. The method of claim 3 wherein estimating the gradient includes computing a variational upper bound on the quantum relative entropy.
 5. The method of claim 3 wherein estimating the gradient includes using substantially commuting operators to assign an energy penalty to each qubit corresponding to a hidden node of the QBM.
 6. The method of claim 3 wherein estimating the gradient includes preparing a purified Gibbs state in the plurality of qubits based on one or more Hamiltonians.
 7. The method of claim 1 wherein estimating the gradient of the quantum relative entropy includes estimating based on one or more high-order divided-difference formulas.
 8. The method of claim 7 wherein estimating based on the one or more high-order divided-difference formulas includes using the quantum computer to compute one or more divided differences of a training objective function of the QBM using Fourier-series methods.
 9. The method of claim 8 wherein using the quantum computer to compute the one or more divided differences includes using the quantum computer to compute one or more truncated Fourier-series expansions.
 10. The method of claim 1 wherein estimating the gradient of the quantum relative entropy includes computing an interpolation polynomial to represent a derivative appearing in the gradient.
 11. The method of claim 1 wherein estimating the gradient of the quantum relative entropy includes applying a sample-based Hamiltonian simulation to provide a distribution σ_(v) over the one or more visible nodes.
 12. A quantum computer comprising: a register including a plurality of qubits; a modulator configured to implement one or more quantum-logic operations on the plurality of qubits; a demodulator configured to output data based on a quantum state of the plurality of qubits; a controller operatively coupled to the modulator and to the demodulator; and associated with the controller, computer memory holding instructions that cause the controller to: instantiate a quantum Boltzmann machine (QBM) having one or more visible nodes and one or more hidden nodes, wherein each visible and each hidden node corresponds to a different qubit of the plurality of qubits, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM, and wherein the weighting factors are trained using a distribution of training data over the one or more output qubits, based on a previously estimated gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data, using the quantum relative entropy as a cost function.
 13. The quantum computer of claim 12 wherein the instructions cause the controller to estimate the gradient of the quantum relative entropy and to train the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.
 14. A quantum computer comprising: a register including a plurality of qubits; a modulator configured to implement one or more quantum-logic operations on the plurality of qubits; a demodulator configured to reveal data based on a quantum state of the plurality of qubits; a controller operatively coupled to the modulator and to the demodulator; and associated with the controller, computer memory holding the stored control-parameter values and holding instructions that cause the controller to: instantiate a quantum Boltzmann machine (QBM) having one or more visible nodes and one or more hidden nodes, wherein each visible and each hidden node corresponds to a different qubit of the plurality of qubits, wherein a state of each of the plurality of qubits contributes to a global energy of the QBM according to a set of weighting factors, and wherein the plurality of qubits include one or more output qubits corresponding to one or more visible nodes of the QBM, provide a distribution of training data over the one or more output qubits; estimate a gradient of a quantum relative entropy between the one or more output qubits and the distribution of training data, and train the set of weighting factors based on the estimated gradient, using the quantum relative entropy as a cost function.
 15. The quantum computer of claim 14 wherein the QBM is a restricted QBM, in which every Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM commutes with every other Hamiltonian operator acting on a qubit corresponding to a hidden node of the QBM, and wherein estimation of the gradient includes computation of a variational upper bound on the quantum relative entropy.
 16. The quantum computer of claim 15 wherein estimation of the gradient includes use of substantially commuting operators to assign an energy penalty to each qubit corresponding to a hidden node of the QBM.
 17. The quantum computer of claim 15 wherein estimation of the gradient includes preparation of a purified Gibbs state in the plurality of qubits based on one or more Hamiltonians.
 18. The quantum computer of claim 14 wherein the gradient of the quantum relative entropy is estimated based on one or more high-order divided-difference formulas, and wherein estimation of the gradient based on the one or more divided-difference formulas includes using the quantum computer to compute one or more divided differences of a training objective function of the QBM using Fourier-series methods.
 19. The quantum computer of claim 18 wherein estimation of the gradient includes computation of an interpolation polynomial to represent a derivative appearing in the gradient.
 20. The quantum computer of claim 18 wherein estimation of the gradient includes applying a sample-based Hamiltonian simulation to provide a distribution σ_(v) over the one or more visible nodes. 